openshift-docs/modules/nvidia-gpu-azure-adding-a-gpu-node.adoc

// Module included in the following assemblies:
//
//  * machine_management/creating-machinesets/creating-machineset-azure.adoc

:_mod-docs-content-type: PROCEDURE
[id="nvidia-gpu-aws-adding-a-gpu-node_{context}"]
= Adding a GPU node to an existing {product-title} cluster

You can copy and modify a default compute machine set configuration to create a GPU-enabled machine set and machines for the Azure cloud provider.

The following table lists the validated instance types:

[cols="1,1,1,1"]
|===
|vmSize |NVIDIA GPU accelerator |Maximum number of GPUs |Architecture

|`Standard_NC24s_v3`
|V100
|4
|x86

|`Standard_NC4as_T4_v3`
|T4
|1
|x86

|`ND A100 v4`
|A100
|8
|x86
|===

[NOTE]
====
By default, Azure subscriptions do not have a quota for the Azure instance types with GPU. Customers have to request a quota increase for the Azure instance families listed above.
====

.Procedure

. View the machines and machine sets that exist in the `openshift-machine-api` namespace
by running the following command. Each compute machine set is associated with a different availability zone within the Azure region.
The installer automatically load balances compute machines across availability zones.
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME                              DESIRED   CURRENT   READY   AVAILABLE   AGE
myclustername-worker-centralus1   1         1         1       1           6h9m
myclustername-worker-centralus2   1         1         1       1           6h9m
myclustername-worker-centralus3   1         1         1       1           6h9m
----

. Make a copy of one of the existing compute `MachineSet` definitions and output the result to a YAML file by running the following command.
This will be the basis for the GPU-enabled compute machine set definition.
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api myclustername-worker-centralus1 -o yaml > machineset-azure.yaml
----

. View the content of the machineset:
+
[source,terminal]
----
$ cat machineset-azure.yaml
----
+
.Example `machineset-azure.yaml` file
+
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "0"
    machine.openshift.io/memoryMb: "16384"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2023-02-06T14:08:19Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: myclustername
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: myclustername-worker-centralus1
  namespace: openshift-machine-api
  resourceVersion: "23601"
  uid: acd56e0c-7612-473a-ae37-8704f34b80de
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: myclustername
      machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: myclustername
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          acceleratedNetworking: true
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          diagnostics: {}
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/myclustername-rg/providers/Microsoft.Compute/galleries/gallery_myclustername_n6n4r/images/myclustername-gen2/versions/latest
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: centralus
          managedIdentity: myclustername-identity
          metadata:
            creationTimestamp: null
          networkResourceGroup: myclustername-rg
          osDisk:
            diskSettings: {}
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: myclustername
          resourceGroup: myclustername-rg
          spotVMOptions: {}
          subnet: myclustername-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_D4s_v3
          vnet: myclustername-vnet
          zone: "1"
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
----

. Make a copy of the `machineset-azure.yaml` file by running the following command:
+
[source,terminal]
----
$ cp machineset-azure.yaml machineset-azure-gpu.yaml
----

. Update the following fields in `machineset-azure-gpu.yaml`:
+
* Change `.metadata.name` to a name containing `gpu`.

* Change `.spec.selector.matchLabels["machine.openshift.io/cluster-api-machineset"]` to match the new .metadata.name.

* Change `.spec.template.metadata.labels["machine.openshift.io/cluster-api-machineset"]` to match the new `.metadata.name`.

* Change `.spec.template.spec.providerSpec.value.vmSize` to `Standard_NC4as_T4_v3`.
+
.Example `machineset-azure-gpu.yaml` file
+
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  annotations:
    machine.openshift.io/GPU: "1"
    machine.openshift.io/memoryMb: "28672"
    machine.openshift.io/vCPU: "4"
  creationTimestamp: "2023-02-06T20:27:12Z"
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: myclustername
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
  name: myclustername-nc4ast4-gpu-worker-centralus1
  namespace: openshift-machine-api
  resourceVersion: "166285"
  uid: 4eedce7f-6a57-4abe-b529-031140f02ffa
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: myclustername
      machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: myclustername
        machine.openshift.io/cluster-api-machine-role: worker
        machine.openshift.io/cluster-api-machine-type: worker
        machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
    spec:
      lifecycleHooks: {}
      metadata: {}
      providerSpec:
        value:
          acceleratedNetworking: true
          apiVersion: machine.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          diagnostics: {}
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/myclustername-rg/providers/Microsoft.Compute/galleries/gallery_myclustername_n6n4r/images/myclustername-gen2/versions/latest
            sku: ""
            version: ""
          kind: AzureMachineProviderSpec
          location: centralus
          managedIdentity: myclustername-identity
          metadata:
            creationTimestamp: null
          networkResourceGroup: myclustername-rg
          osDisk:
            diskSettings: {}
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: myclustername
          resourceGroup: myclustername-rg
          spotVMOptions: {}
          subnet: myclustername-worker-subnet
          userDataSecret:
            name: worker-user-data
          vmSize: Standard_NC4as_T4_v3
          vnet: myclustername-vnet
          zone: "1"
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
----

. To verify your changes, perform a `diff` of the original compute definition and the new GPU-enabled node definition by running the following command:
+
[source,terminal]
----
$ diff machineset-azure.yaml machineset-azure-gpu.yaml
----
+
.Example output
[source,terminal]
----
14c14
<   name: myclustername-worker-centralus1
---
>   name: myclustername-nc4ast4-gpu-worker-centralus1
23c23
<       machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
---
>       machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
30c30
<         machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
---
>         machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
67c67
<           vmSize: Standard_D4s_v3
---
>           vmSize: Standard_NC4as_T4_v3
----

. Create the GPU-enabled compute machine set from the definition file by running the following command:
+
[source,terminal]
----
$ oc create -f machineset-azure-gpu.yaml
----
+
.Example output
+
[source,terminal]
----
machineset.machine.openshift.io/myclustername-nc4ast4-gpu-worker-centralus1 created
----

. View the machines and machine sets that exist in the `openshift-machine-api` namespace
by running the following command. Each compute machine set is associated with a
different availability zone within the Azure region.
The installer automatically load balances compute machines across availability zones.
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME                                               DESIRED   CURRENT   READY   AVAILABLE   AGE
clustername-n6n4r-nc4ast4-gpu-worker-centralus1    1         1         1       1           122m
clustername-n6n4r-worker-centralus1                1         1         1       1           8h
clustername-n6n4r-worker-centralus2                1         1         1       1           8h
clustername-n6n4r-worker-centralus3                1         1         1       1           8h
----

. View the machines that exist in the `openshift-machine-api` namespace by running the following command. You can only configure one compute machine per set, although you can scale a compute machine set to add a node in a particular region and zone.
+
[source,terminal]
----
$ oc get machines -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME                                                PHASE     TYPE                   REGION      ZONE   AGE
myclustername-master-0                              Running   Standard_D8s_v3        centralus   2      6h40m
myclustername-master-1                              Running   Standard_D8s_v3        centralus   1      6h40m
myclustername-master-2                              Running   Standard_D8s_v3        centralus   3      6h40m
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn   Running      centralus   1      21m
myclustername-worker-centralus1-rbh6b               Running   Standard_D4s_v3        centralus   1      6h38m
myclustername-worker-centralus2-dbz7w               Running   Standard_D4s_v3        centralus   2      6h38m
myclustername-worker-centralus3-p9b8c               Running   Standard_D4s_v3        centralus   3      6h38m
----

. View the existing nodes, machines, and machine sets by running the following command. Note that each node is an instance of a machine definition with a specific Azure region and {product-title} role.
+
[source,terminal]
----
$ oc get nodes
----
+
.Example output
+
[source,terminal]
----
NAME                                                STATUS   ROLES                  AGE     VERSION
myclustername-master-0                              Ready    control-plane,master   6h39m   v1.34.2
myclustername-master-1                              Ready    control-plane,master   6h41m   v1.34.2
myclustername-master-2                              Ready    control-plane,master   6h39m   v1.34.2
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn   Ready    worker                 14m     v1.34.2
myclustername-worker-centralus1-rbh6b               Ready    worker                 6h29m   v1.34.2
myclustername-worker-centralus2-dbz7w               Ready    worker                 6h29m   v1.34.2
myclustername-worker-centralus3-p9b8c               Ready    worker                 6h31m   v1.34.2
----

. View the list of compute machine sets:
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
myclustername-worker-centralus1        1         1         1       1           8h
myclustername-worker-centralus2        1         1         1       1           8h
myclustername-worker-centralus3        1         1         1       1           8h
----

. Create the GPU-enabled compute machine set from the definition file by running the following command:
+
[source,terminal]
----
$ oc create -f machineset-azure-gpu.yaml
----

. View the list of compute machine sets:
+
[source,terminal]
----
oc get machineset -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME                                          DESIRED   CURRENT   READY   AVAILABLE   AGE
myclustername-nc4ast4-gpu-worker-centralus1   1         1         1       1           121m
myclustername-worker-centralus1               1         1         1       1           8h
myclustername-worker-centralus2               1         1         1       1           8h
myclustername-worker-centralus3               1         1         1       1           8h
----

.Verification

. View the machine set you created by running the following command:
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api | grep gpu
----
+
The MachineSet replica count is set to `1` so a new `Machine` object is created automatically.
+
.Example output
+
[source,terminal]
----
myclustername-nc4ast4-gpu-worker-centralus1   1         1         1       1           121m
----

. View the `Machine` object that the machine set created by running the following command:
+
[source,terminal]
----
$ oc -n openshift-machine-api get machines | grep gpu
----
+
.Example output
+
[source,terminal]
----
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn   Running   Standard_NC4as_T4_v3   centralus   1      21m
----

[NOTE]
====
There is no need to specify a namespace for the node. The node definition is cluster scoped.
====