mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-05 12:46:18 +01:00
319 lines
7.8 KiB
Plaintext
319 lines
7.8 KiB
Plaintext
// Module included in the following assemblies:
|
|
//
|
|
// * operators/user/das-dynamic-accelerator-slicer-operator.adoc
|
|
|
|
:_mod-docs-content-type: PROCEDURE
|
|
[id="das-operator-installing-cli_{context}"]
|
|
= Installing the Dynamic Accelerator Slicer Operator using the CLI
|
|
|
|
As a cluster administrator, you can install the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift CLI.
|
|
|
|
.Prerequisites
|
|
|
|
* You have access to an {product-title} cluster using an account with `cluster-admin` permissions.
|
|
* You have installed the OpenShift CLI (`oc`).
|
|
* You have installed the required prerequisites:
|
|
** cert-manager Operator for Red Hat OpenShift
|
|
** Node Feature Discovery (NFD) Operator
|
|
** NVIDIA GPU Operator
|
|
** NodeFeatureDiscovery CR
|
|
|
|
.Procedure
|
|
|
|
. Configure the NVIDIA GPU Operator for MIG support:
|
|
|
|
.. Apply the following cluster policy to disable the default NVIDIA device plugin and enable MIG support. Create a file named `gpu-cluster-policy.yaml` with the following content:
|
|
+
|
|
[source,yaml]
|
|
----
|
|
apiVersion: nvidia.com/v1
|
|
kind: ClusterPolicy
|
|
metadata:
|
|
name: gpu-cluster-policy
|
|
spec:
|
|
daemonsets:
|
|
rollingUpdate:
|
|
maxUnavailable: "1"
|
|
updateStrategy: RollingUpdate
|
|
dcgm:
|
|
enabled: true
|
|
dcgmExporter:
|
|
config:
|
|
name: ""
|
|
enabled: true
|
|
serviceMonitor:
|
|
enabled: true
|
|
devicePlugin:
|
|
config:
|
|
default: ""
|
|
name: ""
|
|
enabled: false
|
|
mps:
|
|
root: /run/nvidia/mps
|
|
driver:
|
|
certConfig:
|
|
name: ""
|
|
enabled: true
|
|
kernelModuleConfig:
|
|
name: ""
|
|
licensingConfig:
|
|
configMapName: ""
|
|
nlsEnabled: true
|
|
repoConfig:
|
|
configMapName: ""
|
|
upgradePolicy:
|
|
autoUpgrade: true
|
|
drain:
|
|
deleteEmptyDir: false
|
|
enable: false
|
|
force: false
|
|
timeoutSeconds: 300
|
|
maxParallelUpgrades: 1
|
|
maxUnavailable: 25%
|
|
podDeletion:
|
|
deleteEmptyDir: false
|
|
force: false
|
|
timeoutSeconds: 300
|
|
waitForCompletion:
|
|
timeoutSeconds: 0
|
|
useNvidiaDriverCRD: false
|
|
useOpenKernelModules: false
|
|
virtualTopology:
|
|
config: ""
|
|
gdrcopy:
|
|
enabled: false
|
|
gds:
|
|
enabled: false
|
|
gfd:
|
|
enabled: true
|
|
mig:
|
|
strategy: mixed
|
|
migManager:
|
|
config:
|
|
default: ""
|
|
name: default-mig-parted-config
|
|
enabled: true
|
|
env:
|
|
- name: WITH_REBOOT
|
|
value: 'true'
|
|
- name: MIG_PARTED_MODE_CHANGE_ONLY
|
|
value: 'true'
|
|
nodeStatusExporter:
|
|
enabled: true
|
|
operator:
|
|
defaultRuntime: crio
|
|
initContainer: {}
|
|
runtimeClass: nvidia
|
|
use_ocp_driver_toolkit: true
|
|
sandboxDevicePlugin:
|
|
enabled: true
|
|
sandboxWorkloads:
|
|
defaultWorkload: container
|
|
enabled: false
|
|
toolkit:
|
|
enabled: true
|
|
installDir: /usr/local/nvidia
|
|
validator:
|
|
plugin:
|
|
env:
|
|
- name: WITH_WORKLOAD
|
|
value: "false"
|
|
cuda:
|
|
env:
|
|
- name: WITH_WORKLOAD
|
|
value: "false"
|
|
vfioManager:
|
|
enabled: true
|
|
vgpuDeviceManager:
|
|
enabled: true
|
|
vgpuManager:
|
|
enabled: false
|
|
----
|
|
|
|
.. Apply the cluster policy by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc apply -f gpu-cluster-policy.yaml
|
|
----
|
|
|
|
.. Verify the NVIDIA GPU Operator cluster policy reaches the `Ready` state by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get clusterpolicies.nvidia.com gpu-cluster-policy -w
|
|
----
|
|
+
|
|
Wait until the `STATUS` column shows `ready`.
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
NAME STATUS AGE
|
|
gpu-cluster-policy ready 2025-08-14T08:56:45Z
|
|
----
|
|
|
|
.. Verify that all pods in the NVIDIA GPU Operator namespace are running by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get pods -n nvidia-gpu-operator
|
|
----
|
|
+
|
|
All pods should show a `Running` or `Completed` status.
|
|
|
|
.. Label nodes with MIG-capable GPUs to enable MIG mode by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc label node $NODE_NAME nvidia.com/mig.config=all-enabled --overwrite
|
|
----
|
|
+
|
|
Replace `$NODE_NAME` with the name of each node that has MIG-capable GPUs.
|
|
+
|
|
[IMPORTANT]
|
|
====
|
|
After applying the MIG label, the labeled nodes reboot to enable MIG mode. Wait for the nodes to come back online before proceeding.
|
|
====
|
|
|
|
.. Verify that the nodes have successfully enabled MIG mode by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get nodes -l nvidia.com/mig.config=all-enabled
|
|
----
|
|
|
|
. Create a namespace for the DAS Operator:
|
|
|
|
.. Create the following `Namespace` custom resource (CR) that defines the `das-operator` namespace, and save the YAML in the `das-namespace.yaml` file:
|
|
+
|
|
[source,yaml]
|
|
----
|
|
apiVersion: v1
|
|
kind: Namespace
|
|
metadata:
|
|
name: das-operator
|
|
labels:
|
|
name: das-operator
|
|
openshift.io/cluster-monitoring: "true"
|
|
----
|
|
|
|
.. Create the namespace by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc create -f das-namespace.yaml
|
|
----
|
|
|
|
. Install the DAS Operator in the namespace you created in the previous step by creating the following objects:
|
|
|
|
.. Create the following `OperatorGroup` CR and save the YAML in the `das-operatorgroup.yaml` file:
|
|
+
|
|
[source,yaml]
|
|
----
|
|
apiVersion: operators.coreos.com/v1
|
|
kind: OperatorGroup
|
|
metadata:
|
|
generateName: das-operator-
|
|
name: das-operator
|
|
namespace: das-operator
|
|
----
|
|
|
|
.. Create the `OperatorGroup` CR by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc create -f das-operatorgroup.yaml
|
|
----
|
|
|
|
.. Create the following `Subscription` CR and save the YAML in the `das-sub.yaml` file:
|
|
+
|
|
.Example Subscription
|
|
[source,yaml]
|
|
----
|
|
apiVersion: operators.coreos.com/v1alpha1
|
|
kind: Subscription
|
|
metadata:
|
|
name: das-operator
|
|
namespace: das-operator
|
|
spec:
|
|
channel: "stable"
|
|
installPlanApproval: Automatic
|
|
name: das-operator
|
|
source: redhat-operators
|
|
sourceNamespace: openshift-marketplace
|
|
----
|
|
|
|
.. Create the subscription object by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc create -f das-sub.yaml
|
|
----
|
|
|
|
.. Change to the `das-operator` project:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc project das-operator
|
|
----
|
|
|
|
.. Create the following `DASOperator` CR and save the YAML in the `das-dasoperator.yaml` file:
|
|
+
|
|
.Example `DASOperator` CR
|
|
[source,yaml]
|
|
----
|
|
apiVersion: inference.redhat.com/v1alpha1
|
|
kind: DASOperator
|
|
metadata:
|
|
name: cluster <1>
|
|
namespace: das-operator
|
|
spec:
|
|
managementState: Managed
|
|
logLevel: Normal
|
|
operatorLogLevel: Normal
|
|
----
|
|
<1> The name of the `DASOperator` CR must be `cluster`.
|
|
|
|
.. Create the `dasoperator` CR by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
oc create -f das-dasoperator.yaml
|
|
----
|
|
|
|
.Verification
|
|
|
|
* Verify that the Operator deployment is successful by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get pods
|
|
----
|
|
+
|
|
.Example output
|
|
[source,terminal]
|
|
----
|
|
NAME READY STATUS RESTARTS AGE
|
|
das-daemonset-6rsfd 1/1 Running 0 5m16s
|
|
das-daemonset-8qzgf 1/1 Running 0 5m16s
|
|
das-operator-5946478b47-cjfcp 1/1 Running 0 5m18s
|
|
das-operator-5946478b47-npwmn 1/1 Running 0 5m18s
|
|
das-operator-webhook-59949d4f85-5n9qt 1/1 Running 0 68s
|
|
das-operator-webhook-59949d4f85-nbtdl 1/1 Running 0 68s
|
|
das-scheduler-6cc59dbf96-4r85f 1/1 Running 0 68s
|
|
das-scheduler-6cc59dbf96-bf6ml 1/1 Running 0 68s
|
|
----
|
|
+
|
|
A successful deployment shows all pods with a `Running` status. The deployment includes:
|
|
+
|
|
das-operator:: Main Operator controller pods
|
|
das-operator-webhook:: Webhook server pods for mutating pod requests
|
|
das-scheduler:: Scheduler plugin pods for MIG slice allocation
|
|
das-daemonset:: Daemonset pods that run only on nodes with MIG-compatible GPUs
|
|
+
|
|
[NOTE]
|
|
====
|
|
The `das-daemonset` pods only appear on nodes that have MIG-compatible GPU hardware. If you do not see any daemonset pods, verify that your cluster has nodes with supported GPU hardware and that the NVIDIA GPU Operator is properly configured.
|
|
==== |