1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 21:46:22 +01:00
Files
openshift-docs/modules/nvidia-gpu-aws-adding-a-gpu-node.adoc

189 lines
6.6 KiB
Plaintext

// Module included in the following assemblies:
//
// * machine_management/creating-machinesets/creating-machineset-aws.adoc
:_mod-docs-content-type: PROCEDURE
[id="nvidia-gpu-aws-adding-a-gpu-node_{context}"]
= Adding a GPU node to an existing {product-title} cluster
You can copy and modify a default compute machine set configuration to create a GPU-enabled machine set and machines for the AWS EC2 cloud provider.
For more information about the supported instance types, see the following NVIDIA documentation:
* link:https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/platform-support.html[NVIDIA GPU Operator Community support matrix]
* link:https://docs.nvidia.com/ai-enterprise/latest/product-support-matrix/index.html[NVIDIA AI Enterprise support matrix]
.Procedure
. View the existing nodes, machines, and machine sets by running the following command. Note that each node is an instance of a machine definition with a specific AWS region and {product-title} role.
+
[source,terminal]
----
$ oc get nodes
----
+
.Example output
+
[source,terminal]
----
NAME STATUS ROLES AGE VERSION
ip-10-0-52-50.us-east-2.compute.internal Ready worker 3d17h v1.33.4
ip-10-0-58-24.us-east-2.compute.internal Ready control-plane,master 3d17h v1.33.4
ip-10-0-68-148.us-east-2.compute.internal Ready worker 3d17h v1.33.4
ip-10-0-68-68.us-east-2.compute.internal Ready control-plane,master 3d17h v1.33.4
ip-10-0-72-170.us-east-2.compute.internal Ready control-plane,master 3d17h v1.33.4
ip-10-0-74-50.us-east-2.compute.internal Ready worker 3d17h v1.33.4
----
. View the machines and machine sets that exist in the `openshift-machine-api` namespace by running the following command. Each compute machine set is associated with a different availability zone within the AWS region. The installer automatically load balances compute machines across availability zones.
+
[source,terminal]
----
$ oc get machinesets -n openshift-machine-api
----
+
.Example output
+
[source,terminal]
----
NAME DESIRED CURRENT READY AVAILABLE AGE
preserve-dsoc12r4-ktjfc-worker-us-east-2a 1 1 1 1 3d11h
preserve-dsoc12r4-ktjfc-worker-us-east-2b 2 2 2 2 3d11h
----
. View the machines that exist in the `openshift-machine-api` namespace by running the following command. At this time, there is only one compute machine per machine set, though a compute machine set could be scaled to add a node in a particular region and zone.
+
[source,terminal]
----
$ oc get machines -n openshift-machine-api | grep worker
----
+
.Example output
+
[source,terminal]
----
preserve-dsoc12r4-ktjfc-worker-us-east-2a-dts8r Running m5.xlarge us-east-2 us-east-2a 3d11h
preserve-dsoc12r4-ktjfc-worker-us-east-2b-dkv7w Running m5.xlarge us-east-2 us-east-2b 3d11h
preserve-dsoc12r4-ktjfc-worker-us-east-2b-k58cw Running m5.xlarge us-east-2 us-east-2b 3d11h
----
. Make a copy of one of the existing compute `MachineSet` definitions and output the result to a JSON file by running the following command. This will be the basis for the GPU-enabled compute machine set definition.
+
[source,terminal]
----
$ oc get machineset preserve-dsoc12r4-ktjfc-worker-us-east-2a -n openshift-machine-api -o json > <output_file.json>
----
. Edit the JSON file and make the following changes to the new `MachineSet` definition:
+
* Replace `worker` with `gpu`. This will be the name of the new machine set.
* Change the instance type of the new `MachineSet` definition to `g4dn`, which includes an NVIDIA Tesla T4 GPU.
To learn more about AWS `g4dn` instance types, see link:https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing[Accelerated Computing].
+
[source,terminal]
----
$ jq .spec.template.spec.providerSpec.value.instanceType preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json
"g4dn.xlarge"
----
+
The `<output_file.json>` file is saved as `preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json`.
. Update the following fields in `preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json`:
+
* `.metadata.name` to a name containing `gpu`.
* `.spec.selector.matchLabels["machine.openshift.io/cluster-api-machineset"]` to
match the new `.metadata.name`.
* `.spec.template.metadata.labels["machine.openshift.io/cluster-api-machineset"]`
to match the new `.metadata.name`.
* `.spec.template.spec.providerSpec.value.instanceType` to `g4dn.xlarge`.
. To verify your changes, perform a `diff` of the original compute definition and the new GPU-enabled node definition by running the following command:
+
[source,terminal]
----
$ oc -n openshift-machine-api get preserve-dsoc12r4-ktjfc-worker-us-east-2a -o json | diff preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json -
----
+
.Example output
+
[source,terminal]
----
10c10
< "name": "preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a",
---
> "name": "preserve-dsoc12r4-ktjfc-worker-us-east-2a",
21c21
< "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a"
---
> "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-us-east-2a"
31c31
< "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a"
---
> "machine.openshift.io/cluster-api-machineset": "preserve-dsoc12r4-ktjfc-worker-us-east-2a"
60c60
< "instanceType": "g4dn.xlarge",
---
> "instanceType": "m5.xlarge",
----
. Create the GPU-enabled compute machine set from the definition by running the following command:
+
[source,terminal]
----
$ oc create -f preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a.json
----
+
.Example output
+
[source,terminal]
----
machineset.machine.openshift.io/preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a created
----
.Verification
. View the machine set you created by running the following command:
+
[source,terminal]
----
$ oc -n openshift-machine-api get machinesets | grep gpu
----
+
The MachineSet replica count is set to `1` so a new `Machine` object is created automatically.
+
.Example output
+
[source,terminal]
----
preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a 1 1 1 1 4m21s
----
. View the `Machine` object that the machine set created by running the following command:
+
[source,terminal]
----
$ oc -n openshift-machine-api get machines | grep gpu
----
+
.Example output
+
[source,terminal]
----
preserve-dsoc12r4-ktjfc-worker-gpu-us-east-2a running g4dn.xlarge us-east-2 us-east-2a 4m36s
----
Note that there is no need to specify a namespace for the node. The node definition is cluster scoped.