mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-05 12:46:18 +01:00
241 lines
6.7 KiB
Plaintext
241 lines
6.7 KiB
Plaintext
// Module included in the following assemblies:
|
|
//
|
|
// * operators/user/das-dynamic-accelerator-slicer-operator.adoc
|
|
//
|
|
:_mod-docs-content-type: PROCEDURE
|
|
[id="das-operator-troubleshooting_{context}"]
|
|
= Troubleshooting the Dynamic Accelerator Slicer Operator
|
|
|
|
If you experience issues with the Dynamic Accelerator Slicer (DAS) Operator, use the following troubleshooting steps to diagnose and resolve problems.
|
|
|
|
.Prerequisites
|
|
|
|
* You have installed the DAS Operator.
|
|
* You have access to the {product-title} cluster as a user with the cluster-admin role.
|
|
|
|
== Debugging DAS Operator components
|
|
|
|
.Procedure
|
|
|
|
. Check the status of all DAS Operator components by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get pods -n das-operator
|
|
----
|
|
+
|
|
.Example output
|
|
[source,terminal]
|
|
----
|
|
NAME READY STATUS RESTARTS AGE
|
|
das-daemonset-6rsfd 1/1 Running 0 5m16s
|
|
das-daemonset-8qzgf 1/1 Running 0 5m16s
|
|
das-operator-5946478b47-cjfcp 1/1 Running 0 5m18s
|
|
das-operator-5946478b47-npwmn 1/1 Running 0 5m18s
|
|
das-operator-webhook-59949d4f85-5n9qt 1/1 Running 0 68s
|
|
das-operator-webhook-59949d4f85-nbtdl 1/1 Running 0 68s
|
|
das-scheduler-6cc59dbf96-4r85f 1/1 Running 0 68s
|
|
das-scheduler-6cc59dbf96-bf6ml 1/1 Running 0 68s
|
|
----
|
|
|
|
. Inspect the logs of the DAS Operator controller by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc logs -n das-operator deployment/das-operator
|
|
----
|
|
|
|
. Check the logs of the webhook server by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc logs -n das-operator deployment/das-operator-webhook
|
|
----
|
|
|
|
. Check the logs of the scheduler plugin by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc logs -n das-operator deployment/das-scheduler
|
|
----
|
|
|
|
. Check the logs of the device plugin daemonset by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc logs -n das-operator daemonset/das-daemonset
|
|
----
|
|
|
|
== Monitoring AllocationClaims
|
|
|
|
.Procedure
|
|
|
|
. Inspect active `AllocationClaim` resources by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get allocationclaims -n das-operator
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
NAME AGE
|
|
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 5m
|
|
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 5m
|
|
----
|
|
|
|
. View detailed information about a specific `AllocationClaim` by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get allocationclaims -n das-operator -o yaml
|
|
----
|
|
+
|
|
.Example output (truncated)
|
|
+
|
|
[source,yaml]
|
|
----
|
|
apiVersion: inference.redhat.com/v1alpha1
|
|
kind: AllocationClaim
|
|
metadata:
|
|
name: 13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0
|
|
namespace: das-operator
|
|
spec:
|
|
gpuUUID: GPU-9003fd9c-1ad1-c935-d8cd-d1ae69ef17c0
|
|
migPlacement:
|
|
size: 1
|
|
start: 0
|
|
nodename: harpatil000034jma-qh5fm-worker-f-57md9
|
|
podRef:
|
|
kind: Pod
|
|
name: cuda-vectoradd-f4b84b678-l2m69
|
|
namespace: default
|
|
uid: 13950288-57df-4ab5-82bc-6138f646633e
|
|
profile: 1g.5gb
|
|
status:
|
|
conditions:
|
|
- lastTransitionTime: "2025-08-06T19:28:48Z"
|
|
message: Allocation is inUse
|
|
reason: inUse
|
|
status: "True"
|
|
type: State
|
|
state: inUse
|
|
----
|
|
|
|
. Check for claims in different states by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get allocationclaims -n das-operator -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'
|
|
----
|
|
+
|
|
.Example output
|
|
[source,terminal]
|
|
----
|
|
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 inUse
|
|
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 inUse
|
|
----
|
|
|
|
. View events related to `AllocationClaim` resources by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get events -n das-operator --field-selector involvedObject.kind=AllocationClaim
|
|
----
|
|
|
|
. Check `NodeAccelerator` resources to verify GPU hardware detection by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get nodeaccelerator -n das-operator
|
|
----
|
|
+
|
|
.Example output
|
|
[source,terminal]
|
|
----
|
|
NAME AGE
|
|
harpatil000034jma-qh5fm-worker-f-57md9 96m
|
|
harpatil000034jma-qh5fm-worker-f-fl4wg 96m
|
|
----
|
|
+
|
|
The `NodeAccelerator` resources represent the GPU-capable nodes detected by the DAS Operator.
|
|
|
|
.Additional information
|
|
|
|
The `AllocationClaim` custom resource tracks the following information:
|
|
|
|
GPU UUID:: The unique identifier of the GPU device.
|
|
Slice position:: The position of the MIG slice on the GPU.
|
|
Pod reference:: The pod that requested the GPU slice.
|
|
State:: The current state of the claim (`staged`, `created`, or `released`).
|
|
|
|
Claims start in the `staged` state and transition to `created` when all requests are satisfied. When a pod is deleted, the associated claim is automatically cleaned up.
|
|
|
|
== Verifying GPU device availability
|
|
|
|
.Procedure
|
|
|
|
. On a node with GPU hardware, verify that CDI devices were created by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc debug node/<node-name>
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
sh-4.4# chroot /host
|
|
sh-4.4# ls -l /var/run/cdi/
|
|
----
|
|
|
|
. Check the NVIDIA GPU Operator status by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get clusterpolicies.nvidia.com -o jsonpath='{.items[0].status.state}'
|
|
----
|
|
+
|
|
The output should show `ready`.
|
|
|
|
== Increasing log verbosity
|
|
|
|
.Procedure
|
|
|
|
To get more detailed debugging information:
|
|
|
|
. Edit the `DASOperator` resource to increase log verbosity by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc edit dasoperator -n das-operator
|
|
----
|
|
|
|
. Set the `operatorLogLevel` field to `Debug` or `Trace`:
|
|
+
|
|
[source,yaml]
|
|
----
|
|
spec:
|
|
operatorLogLevel: Debug
|
|
----
|
|
|
|
. Save the changes and verify that the operator pods restart with increased verbosity.
|
|
|
|
== Common issues and solutions
|
|
|
|
.Pods stuck in UnexpectedAdmissionError state
|
|
[NOTE]
|
|
====
|
|
Due to link:https://github.com/kubernetes/kubernetes/issues/128043[kubernetes/kubernetes#128043], pods might enter an `UnexpectedAdmissionError` state if admission fails. Pods managed by higher level controllers such as Deployments are recreated automatically. Naked pods, however, must be cleaned up manually with `oc delete pod`. Using controllers is recommended until the upstream issue is resolved.
|
|
====
|
|
|
|
.Prerequisites not met
|
|
If the DAS Operator fails to start or function properly, verify that all prerequisites are installed:
|
|
|
|
* Cert-manager
|
|
* Node Feature Discovery (NFD) Operator
|
|
* NVIDIA GPU Operator
|
|
|
|
|