1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00
Files
openshift-docs/modules/das-operator-troubleshooting.adoc
2025-08-20 13:42:46 +00:00

241 lines
6.7 KiB
Plaintext

// Module included in the following assemblies:
//
// * operators/user/das-dynamic-accelerator-slicer-operator.adoc
//
:_mod-docs-content-type: PROCEDURE
[id="das-operator-troubleshooting_{context}"]
= Troubleshooting the Dynamic Accelerator Slicer Operator
If you experience issues with the Dynamic Accelerator Slicer (DAS) Operator, use the following troubleshooting steps to diagnose and resolve problems.
.Prerequisites
* You have installed the DAS Operator.
* You have access to the {product-title} cluster as a user with the cluster-admin role.
== Debugging DAS Operator components
.Procedure
. Check the status of all DAS Operator components by running the following command:
+
[source,terminal]
----
$ oc get pods -n das-operator
----
+
.Example output
[source,terminal]
----
NAME READY STATUS RESTARTS AGE
das-daemonset-6rsfd 1/1 Running 0 5m16s
das-daemonset-8qzgf 1/1 Running 0 5m16s
das-operator-5946478b47-cjfcp 1/1 Running 0 5m18s
das-operator-5946478b47-npwmn 1/1 Running 0 5m18s
das-operator-webhook-59949d4f85-5n9qt 1/1 Running 0 68s
das-operator-webhook-59949d4f85-nbtdl 1/1 Running 0 68s
das-scheduler-6cc59dbf96-4r85f 1/1 Running 0 68s
das-scheduler-6cc59dbf96-bf6ml 1/1 Running 0 68s
----
. Inspect the logs of the DAS Operator controller by running the following command:
+
[source,terminal]
----
$ oc logs -n das-operator deployment/das-operator
----
. Check the logs of the webhook server by running the following command:
+
[source,terminal]
----
$ oc logs -n das-operator deployment/das-operator-webhook
----
. Check the logs of the scheduler plugin by running the following command:
+
[source,terminal]
----
$ oc logs -n das-operator deployment/das-scheduler
----
. Check the logs of the device plugin daemonset by running the following command:
+
[source,terminal]
----
$ oc logs -n das-operator daemonset/das-daemonset
----
== Monitoring AllocationClaims
.Procedure
. Inspect active `AllocationClaim` resources by running the following command:
+
[source,terminal]
----
$ oc get allocationclaims -n das-operator
----
+
.Example output
+
[source,terminal]
----
NAME AGE
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 5m
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 5m
----
. View detailed information about a specific `AllocationClaim` by running the following command:
+
[source,terminal]
----
$ oc get allocationclaims -n das-operator -o yaml
----
+
.Example output (truncated)
+
[source,yaml]
----
apiVersion: inference.redhat.com/v1alpha1
kind: AllocationClaim
metadata:
name: 13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0
namespace: das-operator
spec:
gpuUUID: GPU-9003fd9c-1ad1-c935-d8cd-d1ae69ef17c0
migPlacement:
size: 1
start: 0
nodename: harpatil000034jma-qh5fm-worker-f-57md9
podRef:
kind: Pod
name: cuda-vectoradd-f4b84b678-l2m69
namespace: default
uid: 13950288-57df-4ab5-82bc-6138f646633e
profile: 1g.5gb
status:
conditions:
- lastTransitionTime: "2025-08-06T19:28:48Z"
message: Allocation is inUse
reason: inUse
status: "True"
type: State
state: inUse
----
. Check for claims in different states by running the following command:
+
[source,terminal]
----
$ oc get allocationclaims -n das-operator -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'
----
+
.Example output
[source,terminal]
----
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0 inUse
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0 inUse
----
. View events related to `AllocationClaim` resources by running the following command:
+
[source,terminal]
----
$ oc get events -n das-operator --field-selector involvedObject.kind=AllocationClaim
----
. Check `NodeAccelerator` resources to verify GPU hardware detection by running the following command:
+
[source,terminal]
----
$ oc get nodeaccelerator -n das-operator
----
+
.Example output
[source,terminal]
----
NAME AGE
harpatil000034jma-qh5fm-worker-f-57md9 96m
harpatil000034jma-qh5fm-worker-f-fl4wg 96m
----
+
The `NodeAccelerator` resources represent the GPU-capable nodes detected by the DAS Operator.
.Additional information
The `AllocationClaim` custom resource tracks the following information:
GPU UUID:: The unique identifier of the GPU device.
Slice position:: The position of the MIG slice on the GPU.
Pod reference:: The pod that requested the GPU slice.
State:: The current state of the claim (`staged`, `created`, or `released`).
Claims start in the `staged` state and transition to `created` when all requests are satisfied. When a pod is deleted, the associated claim is automatically cleaned up.
== Verifying GPU device availability
.Procedure
. On a node with GPU hardware, verify that CDI devices were created by running the following command:
+
[source,terminal]
----
$ oc debug node/<node-name>
----
+
[source,terminal]
----
sh-4.4# chroot /host
sh-4.4# ls -l /var/run/cdi/
----
. Check the NVIDIA GPU Operator status by running the following command:
+
[source,terminal]
----
$ oc get clusterpolicies.nvidia.com -o jsonpath='{.items[0].status.state}'
----
+
The output should show `ready`.
== Increasing log verbosity
.Procedure
To get more detailed debugging information:
. Edit the `DASOperator` resource to increase log verbosity by running the following command:
+
[source,terminal]
----
$ oc edit dasoperator -n das-operator
----
. Set the `operatorLogLevel` field to `Debug` or `Trace`:
+
[source,yaml]
----
spec:
operatorLogLevel: Debug
----
. Save the changes and verify that the operator pods restart with increased verbosity.
== Common issues and solutions
.Pods stuck in UnexpectedAdmissionError state
[NOTE]
====
Due to link:https://github.com/kubernetes/kubernetes/issues/128043[kubernetes/kubernetes#128043], pods might enter an `UnexpectedAdmissionError` state if admission fails. Pods managed by higher level controllers such as Deployments are recreated automatically. Naked pods, however, must be cleaned up manually with `oc delete pod`. Using controllers is recommended until the upstream issue is resolved.
====
.Prerequisites not met
If the DAS Operator fails to start or function properly, verify that all prerequisites are installed:
* Cert-manager
* Node Feature Discovery (NFD) Operator
* NVIDIA GPU Operator