openshift-docs/modules/nvidia-gpu-features.adoc

// Module included in the following assemblies:
//
// * architecture/nvidia-gpu-architecture-overview.adoc


:_content-type: CONCEPT
[id="nvidia-gpu-features_{context}"]
= NVIDIA GPU features for {product-title}

// NVIDIA GPU Operator::
// The NVIDIA GPU Operator is a Kubernetes Operator that enables {product-title} {VirtProductName} to expose GPUs to virtualized workloads running on {product-title}.
// It allows users to easily provision and manage GPU-enabled virtual machines, providing them with the ability to run complex artificial intelligence/machine learning (AI/ML) workloads on the same platform as their other workloads.
// It also provides an easy way to scale the GPU capacity of their infrastructure, allowing for rapid growth of GPU-based workloads.

NVIDIA Container Toolkit::
NVIDIA Container Toolkit enables you to create and run GPU-accelerated containers. The toolkit includes a container runtime library and utilities to automatically configure containers to use NVIDIA GPUs.

NVIDIA AI Enterprise::
NVIDIA AI Enterprise is an end-to-end, cloud-native suite of AI and data analytics software optimized, certified, and supported with NVIDIA-Certified systems.
+
NVIDIA AI Enterprise includes support for Red Hat {product-title}. The following installation methods are supported:
+
* {product-title} on bare metal or VMware vSphere with GPU Passthrough.

* {product-title} on VMware vSphere with NVIDIA vGPU.


Multi-Instance GPU (MIG) Support in {product-title}::
MIG is useful whenever you have an application that does not require the full power of an entire GPU. The MIG feature of the new NVIDIA Ampere architecture enables you to split your hardware resources into multiple GPU instances, each of which is available to the operating system as an independent CUDA-enabled GPU. The NVIDIA GPU Operator version 1.7.0 and higher provides MIG support for the A100 and A30 Ampere cards. These GPU instances are designed to support multiple independent CUDA applications (up to 7) so that they operate completely isolated from each other with dedicated hardware resources.
+
The GPU's compute units, in addition to their memory, can be split into multiple MIG instances. Each of these instances represents a standalone GPU device from a system perspective and can be connected to any application, container or virtual machine running on the node.
+
From the perspective of the software that uses the GPU, each of these MIG instances looks like its own individual GPU.

Time-slicing NVIDIA GPUs in OpenShift::
GPU time-slicing enables workloads scheduled on overloaded GPUs to be interleaved.
+
This mechanism for enabling time-slicing of GPUs in Kubernetes enables a system administrator to define a set of replicas for a GPU, each of which can be independently distributed to a pod to run workloads on. Unlike multi-instance GPU (MIG), there is no memory or fault isolation between replicas, but for some workloads this is better than not sharing at all. Internally, GPU time-slicing is used to multiplex workloads from replicas of the same underlying GPU.
+
You can apply a cluster-wide default configuration for time slicing. You can also apply node-specific configurations. For example, you can apply a time-slicing configuration only to nodes with Tesla T4 GPUs and not modify nodes with other GPU models.
+
You can combine these two approaches by applying a cluster-wide default configuration and then label nodes to give those nodes receive a node-specific configuration.

GPU Feature Discovery::
NVIDIA GPU Feature Discovery for Kubernetes is a software component that enables you to automatically generate labels for the GPUs available on a node. GPU Feature Discovery uses node feature discovery (NFD) to perform this labeling.
+
The Node Feature Discovery Operator (NFD) manages the discovery of hardware features and configurations in an OpenShift Container Platform cluster by labeling nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, OS version, and so on.
+
You can find the NFD Operator in the Operator Hub by searching for “Node Feature Discovery”.


NVIDIA GPU Operator with OpenShift Virtualization::
Up until this point, the GPU Operator only provisioned worker nodes to run GPU-accelerated containers. Now, the GPU Operator can also be used to provision worker nodes for running GPU-accelerated virtual machines (VMs).
+
You can configure the GPU Operator to deploy different software components to worker nodes depending on which GPU workload is configured to run on those nodes.

GPU Operator dashboard::
You can install a console plugin to display GPU usage information on the cluster utilization screen in the {product title} web console. GPU utilization information includes the number of available GPUs, power consumption (in watts) for each GPU and the percentage of GPU workload used for video encoding and decoding.