Merge pull request #65806 from openshift-cherrypick-robot/cherry-pick-64465-to-enterprise-4.14

[enterprise-4.14] TELCODOCS-1571: Update NVIDIA GPU dashboard
2026-02-06 06:46:26 +01:00 · 2023-10-05 09:56:07 -05:00
parent f62b7c1dfe ed76123794
commit d04460cedf
8 changed files with 4 additions and 229 deletions
--- a/_topic_maps/_topic_map.yml
+++ b/_topic_maps/_topic_map.yml
@@ -2588,8 +2588,6 @@ Topics:
  File: managing-alerts
 - Name: Reviewing monitoring dashboards
  File: reviewing-monitoring-dashboards
- Name: The NVIDIA GPU administration dashboard
-  File: nvidia-gpu-admin-dashboard
 - Name: Monitoring bare-metal events
  File: using-rfhe
 - Name: Accessing third-party monitoring APIs
--- a/architecture/nvidia-gpu-architecture-overview.adoc
+++ b/architecture/nvidia-gpu-architecture-overview.adoc
@@ -57,10 +57,9 @@ include::modules/nvidia-gpu-features.adoc[leveloffset=+1]
 .Additional resources

 * link:https://docs.nvidia.com/ngc/ngc-deploy-on-premises/nvidia-certified-systems/index.html[NVIDIA-Certified Systems]
-* link:https://access.redhat.com/documentation/en-us/openshift_container_platform/4.13/html/monitoring/nvidia-gpu-admin-dashboard#doc-wrapper[The NVIDIA GPU administration dashboard]
 * link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/nvaie-with-ocp.html[NVIDIA AI Enterprise with OpenShift]
 * link:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html#[NVIDIA Container Toolkit]
-* link:https://developer.nvidia.com/dcgm[NVIDIA DCGM]
+* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/enable-gpu-monitoring-dashboard.html[Enabling the GPU Monitoring Dashboard]
 * link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/mig-ocp.html[MIG Support in OpenShift Container Platform]
 * link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/time-slicing-gpus-in-openshift.html[Time-slicing NVIDIA GPUs in OpenShift]
 * link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/mirror-gpu-ocp-disconnected.html[Deploy GPU Operators in a disconnected or airgapped environment]
--- a/modules/nvidia-gpu-admin-dashboard-installing.adoc
+++ b/modules/nvidia-gpu-admin-dashboard-installing.adoc
@@ -1,136 +0,0 @@
-// Module included in the following assemblies:
-//
-// * monitoring/nvidia-gpu-admin-dashboard.adoc
-
-:_content-type: PROCEDURE
-[id="nvidia-gpu-admin-dashboard-installing_{context}"]
-= Installing the NVIDIA GPU administration dashboard
-
-Install the NVIDIA GPU plugin by using Helm on the OpenShift Container Platform (OCP) Console to add GPU capabilities.
-
-The OpenShift Console NVIDIA GPU plugin works as a remote bundle for the OCP console. To run the OpenShift Console NVIDIA GPU plugin
-an instance of the OCP console must be running.
-
-
-.Prerequisites
-
-* Red Hat OpenShift 4.11+
-* NVIDIA GPU operator
-* link:https://helm.sh/docs/intro/install/[Helm]
-
-
-.Procedure
-
-Use the following procedure to install the OpenShift Console NVIDIA GPU plugin.
-
-. Add the Helm repository:
-+
-[source,terminal]
----
-$ helm repo add rh-ecosystem-edge https://rh-ecosystem-edge.github.io/console-plugin-nvidia-gpu
----
-+
-[source,terminal]
----
-$ helm repo update
----
-
-. Install the Helm chart in the default NVIDIA GPU operator namespace:
-+
-[source,terminal]
----
-$ helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu
----
-+
-.Example output
-+
-[source,terminal]
----
-NAME: console-plugin-nvidia-gpu
-LAST DEPLOYED: Tue Aug 23 15:37:35 2022
-NAMESPACE: nvidia-gpu-operator
-STATUS: deployed
-REVISION: 1
-NOTES:
-View the Console Plugin NVIDIA GPU deployed resources by running the following command:
-
-$ oc -n {{ .Release.Namespace }} get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu
-
-Enable the plugin by running the following command:
-
-# Check if a plugins field is specified
-$ oc get consoles.operator.openshift.io cluster --output=jsonpath="{.spec.plugins}"
-
-# if not, then run the following command to enable the plugin
-$ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-plugin-nvidia-gpu"] } }' --type=merge
-
-# if yes, then run the following command to enable the plugin
-$ oc patch consoles.operator.openshift.io cluster --patch '[{"op": "add", "path": "/spec/plugins/-", "value": "console-plugin-nvidia-gpu" }]' --type=json
-
-# add the required DCGM Exporter metrics ConfigMap to the existing NVIDIA operator ClusterPolicy CR:
-oc patch clusterpolicies.nvidia.com gpu-cluster-policy --patch '{ "spec": { "dcgmExporter": { "config": { "name": "console-plugin-nvidia-gpu" } } } }' --type=merge
-
----
-+
-The dashboard relies mostly on Prometheus metrics exposed by the NVIDIA DCGM Exporter, but the default exposed metrics are not enough for the dashboard to render the required gauges. Therefore, the DGCM exporter is configured to expose a custom set of metrics, as shown here.
-+
-[source,yaml]
----
-apiVersion: v1
-data:
-  dcgm-metrics.csv: |
-    DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, gpu utilization.
-    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, mem utilization.
-    DCGM_FI_DEV_ENC_UTIL, gauge, enc utilization.
-    DCGM_FI_DEV_DEC_UTIL, gauge, dec utilization.
-    DCGM_FI_DEV_POWER_USAGE, gauge, power usage.
-    DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX, gauge, power mgmt limit.
-    DCGM_FI_DEV_GPU_TEMP, gauge, gpu temp.
-    DCGM_FI_DEV_SM_CLOCK, gauge, sm clock.
-    DCGM_FI_DEV_MAX_SM_CLOCK, gauge, max sm clock.
-    DCGM_FI_DEV_MEM_CLOCK, gauge, mem clock.
-    DCGM_FI_DEV_MAX_MEM_CLOCK, gauge, max mem clock.
-kind: ConfigMap
-metadata:
-  annotations:
-    meta.helm.sh/release-name: console-plugin-nvidia-gpu
-    meta.helm.sh/release-namespace: nvidia-gpu-operator
-  creationTimestamp: "2022-10-26T19:46:41Z"
-  labels:
-    app.kubernetes.io/component: console-plugin-nvidia-gpu
-    app.kubernetes.io/instance: console-plugin-nvidia-gpu
-    app.kubernetes.io/managed-by: Helm
-    app.kubernetes.io/name: console-plugin-nvidia-gpu
-    app.kubernetes.io/part-of: console-plugin-nvidia-gpu
-    app.kubernetes.io/version: latest
-    helm.sh/chart: console-plugin-nvidia-gpu-0.2.3
-  name: console-plugin-nvidia-gpu
-  namespace: nvidia-gpu-operator
-  resourceVersion: "19096623"
-  uid: 96cdf700-dd27-437b-897d-5cbb1c255068
----
-+
-Install the ConfigMap and edit the NVIDIA Operator ClusterPolicy CR to add that ConfigMap in the DCGM exporter configuration. The installation of the ConfigMap is done by the new version of the Console Plugin NVIDIA GPU Helm Chart, but the ClusterPolicy CR editing is done by the user.
-
-. View the deployed resources:
-+
-[source,terminal]
----
-$ oc -n nvidia-gpu-operator get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu
----
-+
-.Example output
-[source,terminal]
----
-NAME                                             READY   STATUS    RESTARTS   AGE
-pod/console-plugin-nvidia-gpu-7dc9cfb5df-ztksx   1/1     Running   0          2m6s
-
-NAME                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
-service/console-plugin-nvidia-gpu   ClusterIP   172.30.240.138   <none>        9443/TCP   2m6s
-
-NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
-deployment.apps/console-plugin-nvidia-gpu   1/1     1            1           2m6s
-
-NAME                                                   DESIRED   CURRENT   READY   AGE
-replicaset.apps/console-plugin-nvidia-gpu-7dc9cfb5df   1         1         1       2m6s
----
--- a/modules/nvidia-gpu-admin-dashboard-introduction.adoc
+++ b/modules/nvidia-gpu-admin-dashboard-introduction.adoc
@@ -1,14 +0,0 @@
-// Module included in the following assemblies:
-//
-// * monitoring/nvidia-gpu-admin-dashboard.adoc
-
-:_content-type: CONCEPT
-[id="nvidia-gpu-admin-dashboard-introduction_{context}"]
-= Introduction
-
-The OpenShift Console NVIDIA GPU plugin is a dedicated administration dashboard for NVIDIA GPU usage visualization
-in the OpenShift Container Platform (OCP) Console. The visualizations in the administration dashboard provide guidance on how to
-best optimize GPU resources in clusters, such as when a GPU is under- or over-utilized.
-
-The OpenShift Console NVIDIA GPU plugin works as a remote bundle for the OCP console.
-To run the plugin the OCP console must be running.
--- a/modules/nvidia-gpu-admin-dashboard-using.adoc
+++ b/modules/nvidia-gpu-admin-dashboard-using.adoc
@@ -1,59 +0,0 @@
-// Module included in the following assemblies:
-//
-// * monitoring/nvidia-gpu-admin-dashboard.adoc
-
-:_content-type: PROCEDURE
-[id="nvidia-gpu-admin-dashboard-using_{context}"]
-= Using the NVIDIA GPU administration dashboard
-
-After deploying the OpenShift Console NVIDIA GPU plugin, log in to the OpenShift Container Platform web console using your login credentials to access the *Administrator* perspective.
-
-To view the changes, you need to refresh the console to see the **GPUs** tab under **Compute**.
-
-
-== Viewing the cluster GPU overview
-
-You can view the status of your cluster GPUs in the Overview page by selecting
-Overview in the Home section.
-
-The Overview page provides information about the cluster GPUs, including:
-
-* Details about the GPU providers
-* Status of the GPUs
-* Cluster utilization of the GPUs
-
-== Viewing the GPUs dashboard
-
-You can view the NVIDIA GPU administration dashboard by selecting GPUs
-in the Compute section of the OpenShift Console.
-
-
-Charts on the GPUs dashboard include:
-
-* *GPU utilization*: Shows the ratio of time the graphics engine is active and is based on the ``DCGM_FI_PROF_GR_ENGINE_ACTIVE`` metric.
-
-* *Memory utilization*: Shows the memory being used by the GPU and is based on the ``DCGM_FI_DEV_MEM_COPY_UTIL`` metric.
-
-* *Encoder utilization*: Shows the video encoder rate of utilization and is based on the ``DCGM_FI_DEV_ENC_UTIL`` metric.
-
-* *Decoder utilization*: *Encoder utilization*: Shows the video decoder rate of utilization and is based on the ``DCGM_FI_DEV_DEC_UTIL`` metric.
-
-* *Power consumption*: Shows the average power usage of the GPU in Watts and is based on the ``DCGM_FI_DEV_POWER_USAGE`` metric.
-
-* *GPU temperature*: Shows the current GPU temperature and is based on the ``DCGM_FI_DEV_GPU_TEMP`` metric. The maximum is set to ``110``, which is an empirical number, as the actual number is not exposed via a metric.
-
-* *GPU clock speed*: Shows the average clock speed utilized by the GPU and is based on the ``DCGM_FI_DEV_SM_CLOCK`` metric.
-
-* *Memory clock speed*: Shows the average clock speed utilized by memory and is based on the ``DCGM_FI_DEV_MEM_CLOCK`` metric.
-
-== Viewing the GPU Metrics
-
-You can view the metrics for the GPUs by selecting the metric at the bottom of
-each GPU to view the Metrics page.
-
-On the Metrics page, you can:
-
-* Specify a refresh rate for the metrics
-* Add, run, disable, and delete queries
-* Insert Metrics
-* Reset the zoom view
--- a/modules/nvidia-gpu-csps.adoc
+++ b/modules/nvidia-gpu-csps.adoc
@@ -6,7 +6,7 @@
 [id="nvidia-gpu-csps_{context}"]
 = GPUs and CSPs

-You can deploy {product title} to one of the major cloud service providers (CSPs): Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure.
+You can deploy {product-title} to one of the major cloud service providers (CSPs): Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure.

 Two modes of operation are available: a fully managed deployment and a self-managed deployment.

--- a/modules/nvidia-gpu-features.adoc
+++ b/modules/nvidia-gpu-features.adoc
@@ -54,5 +54,5 @@ Up until this point, the GPU Operator only provisioned worker nodes to run GPU-a
 +
 You can configure the GPU Operator to deploy different software components to worker nodes depending on which GPU workload is configured to run on those nodes.

-GPU Operator dashboard::
-You can install a console plugin to display GPU usage information on the cluster utilization screen in the {product title} web console. GPU utilization information includes the number of available GPUs, power consumption (in watts) for each GPU and the percentage of GPU workload used for video encoding and decoding.
+GPU Monitoring dashboard::
+You can install a monitoring dashboard to display GPU usage information on the cluster *Observe* page in the {product-title} web console. GPU utilization information includes the number of available GPUs, power consumption (in watts), temperature (in degrees Celsius), utilization (in percent), and other metrics for each GPU.
--- a/monitoring/nvidia-gpu-admin-dashboard.adoc
+++ b/monitoring/nvidia-gpu-admin-dashboard.adoc
@@ -1,13 +0,0 @@
-:_content-type: ASSEMBLY
-[id="nvidia-gpu-admin-dashboard"]
-= The NVIDIA GPU administration dashboard
-include::_attributes/common-attributes.adoc[]
-:context: nvidia-gpu-admin-dashboard
-
-toc::[]
-
-include::modules/nvidia-gpu-admin-dashboard-introduction.adoc[leveloffset=+1]
-
-include::modules/nvidia-gpu-admin-dashboard-installing.adoc[leveloffset=+1]
-
-include::modules/nvidia-gpu-admin-dashboard-using.adoc[leveloffset=+1]