1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00
Files
openshift-docs/modules/das-operator-installing-cli.adoc
2025-08-20 09:16:47 -04:00

319 lines
7.8 KiB
Plaintext

// Module included in the following assemblies:
//
// * operators/user/das-dynamic-accelerator-slicer-operator.adoc
:_mod-docs-content-type: PROCEDURE
[id="das-operator-installing-cli_{context}"]
= Installing the Dynamic Accelerator Slicer Operator using the CLI
As a cluster administrator, you can install the Dynamic Accelerator Slicer (DAS) Operator using the OpenShift CLI.
.Prerequisites
* You have access to an {product-title} cluster using an account with `cluster-admin` permissions.
* You have installed the OpenShift CLI (`oc`).
* You have installed the required prerequisites:
** cert-manager Operator for Red Hat OpenShift
** Node Feature Discovery (NFD) Operator
** NVIDIA GPU Operator
** NodeFeatureDiscovery CR
.Procedure
. Configure the NVIDIA GPU Operator for MIG support:
.. Apply the following cluster policy to disable the default NVIDIA device plugin and enable MIG support. Create a file named `gpu-cluster-policy.yaml` with the following content:
+
[source,yaml]
----
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
daemonsets:
rollingUpdate:
maxUnavailable: "1"
updateStrategy: RollingUpdate
dcgm:
enabled: true
dcgmExporter:
config:
name: ""
enabled: true
serviceMonitor:
enabled: true
devicePlugin:
config:
default: ""
name: ""
enabled: false
mps:
root: /run/nvidia/mps
driver:
certConfig:
name: ""
enabled: true
kernelModuleConfig:
name: ""
licensingConfig:
configMapName: ""
nlsEnabled: true
repoConfig:
configMapName: ""
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
useNvidiaDriverCRD: false
useOpenKernelModules: false
virtualTopology:
config: ""
gdrcopy:
enabled: false
gds:
enabled: false
gfd:
enabled: true
mig:
strategy: mixed
migManager:
config:
default: ""
name: default-mig-parted-config
enabled: true
env:
- name: WITH_REBOOT
value: 'true'
- name: MIG_PARTED_MODE_CHANGE_ONLY
value: 'true'
nodeStatusExporter:
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
sandboxDevicePlugin:
enabled: true
sandboxWorkloads:
defaultWorkload: container
enabled: false
toolkit:
enabled: true
installDir: /usr/local/nvidia
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: "false"
cuda:
env:
- name: WITH_WORKLOAD
value: "false"
vfioManager:
enabled: true
vgpuDeviceManager:
enabled: true
vgpuManager:
enabled: false
----
.. Apply the cluster policy by running the following command:
+
[source,terminal]
----
$ oc apply -f gpu-cluster-policy.yaml
----
.. Verify the NVIDIA GPU Operator cluster policy reaches the `Ready` state by running the following command:
+
[source,terminal]
----
$ oc get clusterpolicies.nvidia.com gpu-cluster-policy -w
----
+
Wait until the `STATUS` column shows `ready`.
+
.Example output
+
[source,terminal]
----
NAME STATUS AGE
gpu-cluster-policy ready 2025-08-14T08:56:45Z
----
.. Verify that all pods in the NVIDIA GPU Operator namespace are running by running the following command:
+
[source,terminal]
----
$ oc get pods -n nvidia-gpu-operator
----
+
All pods should show a `Running` or `Completed` status.
.. Label nodes with MIG-capable GPUs to enable MIG mode by running the following command:
+
[source,terminal]
----
$ oc label node $NODE_NAME nvidia.com/mig.config=all-enabled --overwrite
----
+
Replace `$NODE_NAME` with the name of each node that has MIG-capable GPUs.
+
[IMPORTANT]
====
After applying the MIG label, the labeled nodes reboot to enable MIG mode. Wait for the nodes to come back online before proceeding.
====
.. Verify that the nodes have successfully enabled MIG mode by running the following command:
+
[source,terminal]
----
$ oc get nodes -l nvidia.com/mig.config=all-enabled
----
. Create a namespace for the DAS Operator:
.. Create the following `Namespace` custom resource (CR) that defines the `das-operator` namespace, and save the YAML in the `das-namespace.yaml` file:
+
[source,yaml]
----
apiVersion: v1
kind: Namespace
metadata:
name: das-operator
labels:
name: das-operator
openshift.io/cluster-monitoring: "true"
----
.. Create the namespace by running the following command:
+
[source,terminal]
----
$ oc create -f das-namespace.yaml
----
. Install the DAS Operator in the namespace you created in the previous step by creating the following objects:
.. Create the following `OperatorGroup` CR and save the YAML in the `das-operatorgroup.yaml` file:
+
[source,yaml]
----
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
generateName: das-operator-
name: das-operator
namespace: das-operator
----
.. Create the `OperatorGroup` CR by running the following command:
+
[source,terminal]
----
$ oc create -f das-operatorgroup.yaml
----
.. Create the following `Subscription` CR and save the YAML in the `das-sub.yaml` file:
+
.Example Subscription
[source,yaml]
----
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: das-operator
namespace: das-operator
spec:
channel: "stable"
installPlanApproval: Automatic
name: das-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
----
.. Create the subscription object by running the following command:
+
[source,terminal]
----
$ oc create -f das-sub.yaml
----
.. Change to the `das-operator` project:
+
[source,terminal]
----
$ oc project das-operator
----
.. Create the following `DASOperator` CR and save the YAML in the `das-dasoperator.yaml` file:
+
.Example `DASOperator` CR
[source,yaml]
----
apiVersion: inference.redhat.com/v1alpha1
kind: DASOperator
metadata:
name: cluster <1>
namespace: das-operator
spec:
managementState: Managed
logLevel: Normal
operatorLogLevel: Normal
----
<1> The name of the `DASOperator` CR must be `cluster`.
.. Create the `dasoperator` CR by running the following command:
+
[source,terminal]
----
oc create -f das-dasoperator.yaml
----
.Verification
* Verify that the Operator deployment is successful by running the following command:
+
[source,terminal]
----
$ oc get pods
----
+
.Example output
[source,terminal]
----
NAME READY STATUS RESTARTS AGE
das-daemonset-6rsfd 1/1 Running 0 5m16s
das-daemonset-8qzgf 1/1 Running 0 5m16s
das-operator-5946478b47-cjfcp 1/1 Running 0 5m18s
das-operator-5946478b47-npwmn 1/1 Running 0 5m18s
das-operator-webhook-59949d4f85-5n9qt 1/1 Running 0 68s
das-operator-webhook-59949d4f85-nbtdl 1/1 Running 0 68s
das-scheduler-6cc59dbf96-4r85f 1/1 Running 0 68s
das-scheduler-6cc59dbf96-bf6ml 1/1 Running 0 68s
----
+
A successful deployment shows all pods with a `Running` status. The deployment includes:
+
das-operator:: Main Operator controller pods
das-operator-webhook:: Webhook server pods for mutating pod requests
das-scheduler:: Scheduler plugin pods for MIG slice allocation
das-daemonset:: Daemonset pods that run only on nodes with MIG-compatible GPUs
+
[NOTE]
====
The `das-daemonset` pods only appear on nodes that have MIG-compatible GPU hardware. If you do not see any daemonset pods, verify that your cluster has nodes with supported GPU hardware and that the NVIDIA GPU Operator is properly configured.
====