1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-06 15:46:57 +01:00

Merge pull request #55724 from openshift-cherrypick-robot/cherry-pick-53509-to-enterprise-4.13

This commit is contained in:
Alex Dellapenta
2023-02-08 12:44:27 -07:00
committed by GitHub
10 changed files with 227 additions and 141 deletions

View File

@@ -0,0 +1,23 @@
// Module included in the following assemblies:
//
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
:_content-type: CONCEPT
[id="about-using-nvidia-gpu_{context}"]
= About using the NVIDIA GPU Operator
The NVIDIA GPU Operator manages NVIDIA GPU resources in an {product-title} cluster and automates tasks related to bootstrapping GPU nodes.
Since the GPU is a special resource in the cluster, you must install some components before deploying application workloads onto the GPU.
These components include the NVIDIA drivers which enables compute unified device architecture (CUDA), Kubernetes device plugin, container runtime and others such as automatic node labelling, monitoring and more.
[NOTE]
====
The NVIDIA GPU Operator is supported only by NVIDIA. For more information about obtaining support from NVIDIA, see link:https://access.redhat.com/solutions/5174941[Obtaining Support from NVIDIA].
====
There are two ways to enable GPUs with {product-title} {VirtProductName}: the {product-title}-native way described here and by using the NVIDIA GPU Operator.
The NVIDIA GPU Operator is a Kubernetes Operator that enables {product-title} {VirtProductName} to expose GPUs to virtualized workloads running on {product-title}.
It allows users to easily provision and manage GPU-enabled virtual machines, providing them with the ability to run complex artificial intelligence/machine learning (AI/ML) workloads on the same platform as their other workloads.
It also provides an easy way to scale the GPU capacity of their infrastructure, allowing for rapid growth of GPU-based workloads.
For more information about using the NVIDIA GPU Operator to provision worker nodes for running GPU-accelerated VMs, see link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/openshift-virtualization.html[NVIDIA GPU Operator with OpenShift Virtualization].

View File

@@ -0,0 +1,9 @@
// Module included in the following assemblies:
//
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
:_content-type: CONCEPT
[id="virt-using-mediated-devices_{context}"]
= Using mediated devices
A vGPU is a type of mediated device; the performance of the physical GPU is divided among the virtual devices. You can assign mediated devices to one or more virtual machines.

View File

@@ -0,0 +1,23 @@
// Module included in the following assemblies:
//
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
:_content-type: CONCEPT
[id="about-changing-removing-mediated-devices_{context}"]
= About changing and removing mediated devices
The cluster's mediated device configuration can be updated with {VirtProductName} by:
* Editing the `HyperConverged` CR and change the contents of the `mediatedDevicesTypes` stanza.
* Changing the node labels that match the `nodeMediatedDeviceTypes` node selector.
* Removing the device information from the `spec.mediatedDevicesConfiguration` and `spec.permittedHostDevices` stanzas of the `HyperConverged` CR.
+
[NOTE]
====
If you remove the device information from the `spec.permittedHostDevices` stanza without also removing it from the `spec.mediatedDevicesConfiguration` stanza, you cannot create a new mediated device type on the same node. To properly remove mediated devices, remove the device information from both stanzas.
====
Depending on the specific changes, these actions cause {VirtProductName} to reconfigure mediated devices or remove them from the cluster nodes.

View File

@@ -15,130 +15,3 @@ Refer to your hardware vendor's documentation for functionality and support deta
Mediated device:: A physical device that is divided into one or more virtual devices. A vGPU is a type of mediated device (mdev); the performance of the physical GPU is divided among the virtual devices. You can assign mediated devices to one or more virtual machines (VMs), but the number of guests must be compatible with your GPU. Some GPUs do not support multiple guests.
[id="configuration-overview_{context}"]
== Configuration overview
When configuring mediated devices, an administrator must:
* Create the mediated devices.
* Expose the mediated devices to the cluster.
The `HyperConverged` CR includes APIs that accomplish both tasks:
.Creating mediated devices
[source,yaml]
----
...
spec:
mediatedDevicesConfiguration:
mediatedDevicesTypes: <.>
- <device_type>
nodeMediatedDeviceTypes: <.>
- mediatedDevicesTypes: <.>
- <device_type>
nodeSelector: <.>
<node_selector_key>: <node_selector_value>
...
----
<.> Required: Configures global settings for the cluster.
<.> Optional: Overrides the global configuration for a specific node or group of nodes. Must be used with the global `mediatedDevicesTypes` configuration.
<.> Required if you use `nodeMediatedDeviceTypes`. Overrides the global `mediatedDevicesTypes` configuration for select nodes.
<.> Required if you use `nodeMediatedDeviceTypes`. Must include a `key:value` pair.
.Exposing mediated devices to the cluster
[source,yaml]
----
...
permittedHostDevices:
mediatedDevices:
- mdevNameSelector: GRID T4-2Q <.>
resourceName: nvidia.com/GRID_T4-2Q
...
----
<.> Exposes the mediated devices that map to this value on the host.
+
[NOTE]
====
You can see the mediated device types that your device supports by viewing the contents of `/sys/bus/pci/devices/<slot>:<bus>:<domain>.<function>/mdev_supported_types/<type>/name`, substituting the correct values for your system.
For example, the name file for the `nvidia-231` type contains the selector string `GRID T4-2Q`. Using `GRID T4-2Q` as the `mdevNameSelector` value allows nodes to use the `nvidia-231` type.
====
[id="how-vgpus-are-assigned-to-nodes_{context}"]
== How vGPUs are assigned to nodes
For each physical device, {VirtProductName} configures:
* A single mdev type.
* The maximum number of instances of the selected mdev type.
The cluster architecture affects how devices are created and assigned to nodes.
Large cluster with multiple cards per node:: On nodes with multiple cards that can support similar vGPU types, the relevant device types are created in a round-robin manner.
For example:
+
[source,yaml]
----
...
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-222
- nvidia-228
- nvidia-105
- nvidia-108
...
----
+
In this scenario, each node has two cards, both of which support the following vGPU types:
+
[source,text]
----
nvidia-105
...
nvidia-108
nvidia-217
nvidia-299
...
----
+
On each node, {VirtProductName} creates:
* 16 vGPUs of type nvidia-105 on the first card.
* 2 vGPUs of type nvidia-108 on the second card.
One node has a single card that supports more than one requested vGPU type:: {VirtProductName} uses the supported type that comes first on the `mediatedDevicesTypes` list.
+
For example, a node's card supports `nvidia-223` and `nvidia-224`. The following `mediatedDevicesTypes` list is configured:
+
[source,yaml]
----
...
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-22
- nvidia-223
- nvidia-224
...
----
+
In this example, {VirtProductName} uses the `nvidia-223` type.
[id="about-changing-removing-mediated-devices_{context}"]
== About changing and removing mediated devices
{VirtProductName} updates the cluster's mediated device configuration if:
* You edit the `HyperConverged` CR and change the contents of the `mediatedDevicesTypes` stanza.
* You change the node labels that match the `nodeMediatedDeviceTypes` node selector.
* You remove the device information from the `spec.mediatedDevicesConfiguration` and `spec.permittedHostDevices` stanzas of the `HyperConverged` CR.
+
[NOTE]
====
If you remove the device information from the `spec.permittedHostDevices` stanza without also removing it from the `spec.mediatedDevicesConfiguration` stanza, you cannot create a new mediated device type on the same node. To properly remove mediated devices, remove the device information from both stanzas.
====
Depending on the specific changes, these actions cause {VirtProductName} to reconfigure mediated devices or remove them from the cluster nodes.

View File

@@ -0,0 +1,9 @@
// Module included in the following assemblies:
//
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
:_content-type: CONCEPT
[id="virt-adding-and-removing-mediated-devices_context"]
= Adding and removing mediated devices
You can add or remove mediated devices.

View File

@@ -0,0 +1,63 @@
// Module included in the following assemblies:
//
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
:_content-type: REFERENCE
[id="how-vgpus-are-assigned-to-nodes_{context}"]
= How vGPUs are assigned to nodes
For each physical device, {VirtProductName} configures the following values:
* A single mdev type.
* The maximum number of instances of the selected `mdev` type.
The cluster architecture affects how devices are created and assigned to nodes.
Large cluster with multiple cards per node:: On nodes with multiple cards that can support similar vGPU types, the relevant device types are created in a round-robin manner.
For example:
+
[source,yaml]
----
...
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-222
- nvidia-228
- nvidia-105
- nvidia-108
...
----
+
In this scenario, each node has two cards, both of which support the following vGPU types:
+
[source,text]
----
nvidia-105
...
nvidia-108
nvidia-217
nvidia-299
...
----
+
On each node, {VirtProductName} creates the following vGPUs:
* 16 vGPUs of type nvidia-105 on the first card.
* 2 vGPUs of type nvidia-108 on the second card.
One node has a single card that supports more than one requested vGPU type:: {VirtProductName} uses the supported type that comes first on the `mediatedDevicesTypes` list.
+
For example, the card on a node card supports `nvidia-223` and `nvidia-224`. The following `mediatedDevicesTypes` list is configured:
+
[source,yaml]
----
...
mediatedDevicesConfiguration:
mediatedDevicesTypes:
- nvidia-22
- nvidia-223
- nvidia-224
...
----
+
In this example, {VirtProductName} uses the `nvidia-223` type.

View File

@@ -0,0 +1,10 @@
// Module included in the following assemblies:
//
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
:_content-type: CONCEPT
[id="virt-preparing-host-for-mdevs_{context}"]
= Preparing hosts for mediated devices
You must enable the Input-Output Memory Management Unit (IOMMU) driver before you can configure mediated devices.

View File

@@ -0,0 +1,10 @@
// Module included in the following assemblies:
//
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
:_content-type: CONCEPT
[id="prerequisites_{context}"]
== Prerequisites
* If your hardware vendor provides drivers, you installed them on the nodes where you want to create mediated devices.
** If you use NVIDIA cards, you link:https://access.redhat.com/solutions/6738411[installed the NVIDIA GRID driver].

View File

@@ -0,0 +1,64 @@
// Module included in the following assemblies:
//
// * virt/virtual_machines/advanced_vm_management/virt-configuring-mediated-devices.adoc
:_content-type: REFERENCE
[id="configuration-overview_{context}"]
= Configuration overview
When configuring mediated devices, an administrator must complete the following tasks:
* Create the mediated devices.
* Expose the mediated devices to the cluster.
The `HyperConverged` CR includes APIs that accomplish both tasks.
.Creating mediated devices
[source,yaml]
----
...
spec:
mediatedDevicesConfiguration:
mediatedDevicesTypes: <1>
- <device_type>
nodeMediatedDeviceTypes: <2>
- mediatedDevicesTypes: <3>
- <device_type>
nodeSelector: <4>
<node_selector_key>: <node_selector_value>
...
----
<1> Required: Configures global settings for the cluster.
<2> Optional: Overrides the global configuration for a specific node or group of nodes. Must be used with the global `mediatedDevicesTypes` configuration.
<3> Required if you use `nodeMediatedDeviceTypes`. Overrides the global `mediatedDevicesTypes` configuration for the specified nodes.
<4> Required if you use `nodeMediatedDeviceTypes`. Must include a `key:value` pair.
.Exposing mediated devices to the cluster
[source,yaml]
----
...
permittedHostDevices:
mediatedDevices:
- mdevNameSelector: GRID T4-2Q <1>
resourceName: nvidia.com/GRID_T4-2Q <2>
...
----
<1> Exposes the mediated devices that map to this value on the host.
+
[NOTE]
====
You can see the mediated device types that your device supports by viewing the contents of `/sys/bus/pci/devices/<slot>:<bus>:<domain>.<function>/mdev_supported_types/<type>/name`, substituting the correct values for your system.
For example, the name file for the `nvidia-231` type contains the selector string `GRID T4-2Q`. Using `GRID T4-2Q` as the `mdevNameSelector` value allows nodes to use the `nvidia-231` type.
====
<2> The `resourceName` should match that allocated on the node. Find the `resourceName` by using the following command:
+
[source,terminal]
----
$ oc get $NODE -o json \
| jq '.status.allocatable | \
with_entries(select(.key | startswith("nvidia.com/"))) | \
with_entries(select(.value != "0"))'
----

View File

@@ -13,31 +13,33 @@ ifdef::openshift-enterprise[]
include::snippets/technology-preview.adoc[]
endif::[]
[id="prerequisites_virt-configuring-mediated-devices"]
== Prerequisites
* If your hardware vendor provides drivers, you installed them on the nodes where you want to create mediated devices.
** If you use NVIDIA cards, you link:https://access.redhat.com/solutions/6738411[installed the NVIDIA GRID driver].
include::modules/about-using-gpu-operator.adoc[leveloffset=+1]
include::modules/virt-about-using-virtual-gpus.adoc[leveloffset=+1]
[id="virt-preparing-host-for-mdevs"]
== Preparing hosts for mediated devices
include::modules/virt-prerequisites-mediated-devices.adoc[leveloffset=+2]
You must enable the IOMMU (Input-Output Memory Management Unit) driver before you can configure mediated devices.
include::modules/virt-virtual-gpus-config-overview.adoc[leveloffset=+2]
include::modules/virt-adding-kernel-arguments-enable-iommu.adoc[leveloffset=+2]
include::modules/virt-how-virtual-gpus-assigned-nodes.adoc[leveloffset=+2]
[id="virt-adding-and-removing-mediated-devices"]
== Adding and removing mediated devices
include::modules/virt-about-changing-removing-mediated-devices.adoc[leveloffset=+2]
include::modules/virt-creating-and-exposing-mediated-devices.adoc[leveloffset=+2]
include::modules/virt-preparing-hosts-for-mediated-devices.adoc[leveloffset=+2]
include::modules/virt-removing-mediated-device-from-cluster-cli.adoc[leveloffset=+2]
include::modules/virt-adding-kernel-arguments-enable-iommu.adoc[leveloffset=+3]
include::modules/virt-add-remove-mediated-devices.adoc[leveloffset=+2]
include::modules/virt-creating-and-exposing-mediated-devices.adoc[leveloffset=+3]
include::modules/virt-removing-mediated-device-from-cluster-cli.adoc[leveloffset=+3]
// VM owner task:
include::modules/virt-assigning-mediated-device-virtual-machine.adoc[leveloffset=+1]
include::modules/using-mediated-devices.adoc[leveloffset=+1]
include::modules/virt-assigning-mediated-device-virtual-machine.adoc[leveloffset=+2]
[role="_additional-resources"]
[id="additional-resources_virt-configuring-mediated-devices"]