1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00
Files
openshift-docs/modules/nvidia-gpu-azure-adding-a-gpu-node.adoc

436 lines
14 KiB
Plaintext

// Module included in the following assemblies:
//
// * machine_management/creating-machinesets/creating-machineset-azure.adoc
:_mod-docs-content-type: PROCEDURE
[id="nvidia-gpu-aws-adding-a-gpu-node_{context}"]
= Adding a GPU node to an existing {product-title} cluster
You can copy and modify a default compute machine set configuration to create a GPU-enabled machine set and machines for the Azure cloud provider.
The following table lists the validated instance types:
[cols="1,1,1,1"]
|===
|vmSize |NVIDIA GPU accelerator |Maximum number of GPUs |Architecture
|`Standard_NC24s_v3`
|V100
|4
|x86
|`Standard_NC4as_T4_v3`
|T4
|1
|x86
|`ND A100 v4`
|A100
|8
|x86
|===
[NOTE]
====
By default, Azure subscriptions do not have a quota for the Azure instance types with GPU. Customers have to request a quota increase for the Azure instance families listed above.
====
.Procedure
. View the machines and machine sets that exist in the `openshift-machine-api` namespace
by running the following command. Each compute machine set is associated with a different availability zone within the Azure region.
The installer automatically load balances compute machines across availability zones.
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME DESIRED CURRENT READY AVAILABLE AGE
myclustername-worker-centralus1 1 1 1 1 6h9m
myclustername-worker-centralus2 1 1 1 1 6h9m
myclustername-worker-centralus3 1 1 1 1 6h9m
----
. Make a copy of one of the existing compute `MachineSet` definitions and output the result to a YAML file by running the following command.
This will be the basis for the GPU-enabled compute machine set definition.
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api myclustername-worker-centralus1 -o yaml > machineset-azure.yaml
----
. View the content of the machineset:
+
[source,terminal]
----
$ cat machineset-azure.yaml
----
+
.Example `machineset-azure.yaml` file
+
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
annotations:
machine.openshift.io/GPU: "0"
machine.openshift.io/memoryMb: "16384"
machine.openshift.io/vCPU: "4"
creationTimestamp: "2023-02-06T14:08:19Z"
generation: 1
labels:
machine.openshift.io/cluster-api-cluster: myclustername
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
name: myclustername-worker-centralus1
namespace: openshift-machine-api
resourceVersion: "23601"
uid: acd56e0c-7612-473a-ae37-8704f34b80de
spec:
replicas: 1
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: myclustername
machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: myclustername
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
spec:
lifecycleHooks: {}
metadata: {}
providerSpec:
value:
acceleratedNetworking: true
apiVersion: machine.openshift.io/v1beta1
credentialsSecret:
name: azure-cloud-credentials
namespace: openshift-machine-api
diagnostics: {}
image:
offer: ""
publisher: ""
resourceID: /resourceGroups/myclustername-rg/providers/Microsoft.Compute/galleries/gallery_myclustername_n6n4r/images/myclustername-gen2/versions/latest
sku: ""
version: ""
kind: AzureMachineProviderSpec
location: centralus
managedIdentity: myclustername-identity
metadata:
creationTimestamp: null
networkResourceGroup: myclustername-rg
osDisk:
diskSettings: {}
diskSizeGB: 128
managedDisk:
storageAccountType: Premium_LRS
osType: Linux
publicIP: false
publicLoadBalancer: myclustername
resourceGroup: myclustername-rg
spotVMOptions: {}
subnet: myclustername-worker-subnet
userDataSecret:
name: worker-user-data
vmSize: Standard_D4s_v3
vnet: myclustername-vnet
zone: "1"
status:
availableReplicas: 1
fullyLabeledReplicas: 1
observedGeneration: 1
readyReplicas: 1
replicas: 1
----
. Make a copy of the `machineset-azure.yaml` file by running the following command:
+
[source,terminal]
----
$ cp machineset-azure.yaml machineset-azure-gpu.yaml
----
. Update the following fields in `machineset-azure-gpu.yaml`:
+
* Change `.metadata.name` to a name containing `gpu`.
* Change `.spec.selector.matchLabels["machine.openshift.io/cluster-api-machineset"]` to match the new .metadata.name.
* Change `.spec.template.metadata.labels["machine.openshift.io/cluster-api-machineset"]` to match the new `.metadata.name`.
* Change `.spec.template.spec.providerSpec.value.vmSize` to `Standard_NC4as_T4_v3`.
+
.Example `machineset-azure-gpu.yaml` file
+
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
annotations:
machine.openshift.io/GPU: "1"
machine.openshift.io/memoryMb: "28672"
machine.openshift.io/vCPU: "4"
creationTimestamp: "2023-02-06T20:27:12Z"
generation: 1
labels:
machine.openshift.io/cluster-api-cluster: myclustername
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
name: myclustername-nc4ast4-gpu-worker-centralus1
namespace: openshift-machine-api
resourceVersion: "166285"
uid: 4eedce7f-6a57-4abe-b529-031140f02ffa
spec:
replicas: 1
selector:
matchLabels:
machine.openshift.io/cluster-api-cluster: myclustername
machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
template:
metadata:
labels:
machine.openshift.io/cluster-api-cluster: myclustername
machine.openshift.io/cluster-api-machine-role: worker
machine.openshift.io/cluster-api-machine-type: worker
machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
spec:
lifecycleHooks: {}
metadata: {}
providerSpec:
value:
acceleratedNetworking: true
apiVersion: machine.openshift.io/v1beta1
credentialsSecret:
name: azure-cloud-credentials
namespace: openshift-machine-api
diagnostics: {}
image:
offer: ""
publisher: ""
resourceID: /resourceGroups/myclustername-rg/providers/Microsoft.Compute/galleries/gallery_myclustername_n6n4r/images/myclustername-gen2/versions/latest
sku: ""
version: ""
kind: AzureMachineProviderSpec
location: centralus
managedIdentity: myclustername-identity
metadata:
creationTimestamp: null
networkResourceGroup: myclustername-rg
osDisk:
diskSettings: {}
diskSizeGB: 128
managedDisk:
storageAccountType: Premium_LRS
osType: Linux
publicIP: false
publicLoadBalancer: myclustername
resourceGroup: myclustername-rg
spotVMOptions: {}
subnet: myclustername-worker-subnet
userDataSecret:
name: worker-user-data
vmSize: Standard_NC4as_T4_v3
vnet: myclustername-vnet
zone: "1"
status:
availableReplicas: 1
fullyLabeledReplicas: 1
observedGeneration: 1
readyReplicas: 1
replicas: 1
----
. To verify your changes, perform a `diff` of the original compute definition and the new GPU-enabled node definition by running the following command:
+
[source,terminal]
----
$ diff machineset-azure.yaml machineset-azure-gpu.yaml
----
+
.Example output
[source,terminal]
----
14c14
< name: myclustername-worker-centralus1
---
> name: myclustername-nc4ast4-gpu-worker-centralus1
23c23
< machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
---
> machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
30c30
< machine.openshift.io/cluster-api-machineset: myclustername-worker-centralus1
---
> machine.openshift.io/cluster-api-machineset: myclustername-nc4ast4-gpu-worker-centralus1
67c67
< vmSize: Standard_D4s_v3
---
> vmSize: Standard_NC4as_T4_v3
----
. Create the GPU-enabled compute machine set from the definition file by running the following command:
+
[source,terminal]
----
$ oc create -f machineset-azure-gpu.yaml
----
+
.Example output
+
[source,terminal]
----
machineset.machine.openshift.io/myclustername-nc4ast4-gpu-worker-centralus1 created
----
. View the machines and machine sets that exist in the `openshift-machine-api` namespace
by running the following command. Each compute machine set is associated with a
different availability zone within the Azure region.
The installer automatically load balances compute machines across availability zones.
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME DESIRED CURRENT READY AVAILABLE AGE
clustername-n6n4r-nc4ast4-gpu-worker-centralus1 1 1 1 1 122m
clustername-n6n4r-worker-centralus1 1 1 1 1 8h
clustername-n6n4r-worker-centralus2 1 1 1 1 8h
clustername-n6n4r-worker-centralus3 1 1 1 1 8h
----
. View the machines that exist in the `openshift-machine-api` namespace by running the following command. You can only configure one compute machine per set, although you can scale a compute machine set to add a node in a particular region and zone.
+
[source,terminal]
----
$ oc get machines -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME PHASE TYPE REGION ZONE AGE
myclustername-master-0 Running Standard_D8s_v3 centralus 2 6h40m
myclustername-master-1 Running Standard_D8s_v3 centralus 1 6h40m
myclustername-master-2 Running Standard_D8s_v3 centralus 3 6h40m
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn Running centralus 1 21m
myclustername-worker-centralus1-rbh6b Running Standard_D4s_v3 centralus 1 6h38m
myclustername-worker-centralus2-dbz7w Running Standard_D4s_v3 centralus 2 6h38m
myclustername-worker-centralus3-p9b8c Running Standard_D4s_v3 centralus 3 6h38m
----
. View the existing nodes, machines, and machine sets by running the following command. Note that each node is an instance of a machine definition with a specific Azure region and {product-title} role.
+
[source,terminal]
----
$ oc get nodes
----
+
.Example output
+
[source,terminal]
----
NAME STATUS ROLES AGE VERSION
myclustername-master-0 Ready control-plane,master 6h39m v1.34.2
myclustername-master-1 Ready control-plane,master 6h41m v1.34.2
myclustername-master-2 Ready control-plane,master 6h39m v1.34.2
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn Ready worker 14m v1.34.2
myclustername-worker-centralus1-rbh6b Ready worker 6h29m v1.34.2
myclustername-worker-centralus2-dbz7w Ready worker 6h29m v1.34.2
myclustername-worker-centralus3-p9b8c Ready worker 6h31m v1.34.2
----
. View the list of compute machine sets:
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME DESIRED CURRENT READY AVAILABLE AGE
myclustername-worker-centralus1 1 1 1 1 8h
myclustername-worker-centralus2 1 1 1 1 8h
myclustername-worker-centralus3 1 1 1 1 8h
----
. Create the GPU-enabled compute machine set from the definition file by running the following command:
+
[source,terminal]
----
$ oc create -f machineset-azure-gpu.yaml
----
. View the list of compute machine sets:
+
[source,terminal]
----
oc get machineset -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME DESIRED CURRENT READY AVAILABLE AGE
myclustername-nc4ast4-gpu-worker-centralus1 1 1 1 1 121m
myclustername-worker-centralus1 1 1 1 1 8h
myclustername-worker-centralus2 1 1 1 1 8h
myclustername-worker-centralus3 1 1 1 1 8h
----
.Verification
. View the machine set you created by running the following command:
+
[source,terminal]
----
$ oc get machineset -n openshift-machine-api | grep gpu
----
+
The MachineSet replica count is set to `1` so a new `Machine` object is created automatically.
+
.Example output
+
[source,terminal]
----
myclustername-nc4ast4-gpu-worker-centralus1 1 1 1 1 121m
----
. View the `Machine` object that the machine set created by running the following command:
+
[source,terminal]
----
$ oc -n openshift-machine-api get machines | grep gpu
----
+
.Example output
+
[source,terminal]
----
myclustername-nc4ast4-gpu-worker-centralus1-w9bqn Running Standard_NC4as_T4_v3 centralus 1 21m
----
[NOTE]
====
There is no need to specify a namespace for the node. The node definition is cluster scoped.
====