openshift-docs/modules/das-operator-troubleshooting.adoc

// Module included in the following assemblies:
//
// * operators/user/das-dynamic-accelerator-slicer-operator.adoc
//
:_mod-docs-content-type: PROCEDURE
[id="das-operator-troubleshooting_{context}"]
= Troubleshooting the Dynamic Accelerator Slicer Operator

If you experience issues with the Dynamic Accelerator Slicer (DAS) Operator, use the following troubleshooting steps to diagnose and resolve problems.

.Prerequisites

* You have installed the DAS Operator.
* You have access to the {product-title} cluster as a user with the cluster-admin role.

== Debugging DAS Operator components

.Procedure

. Check the status of all DAS Operator components by running the following command:
+
[source,terminal]
----
$ oc get pods -n das-operator
----
+
.Example output
[source,terminal]
----
NAME                                    READY   STATUS    RESTARTS   AGE
das-daemonset-6rsfd                     1/1     Running   0          5m16s
das-daemonset-8qzgf                     1/1     Running   0          5m16s
das-operator-5946478b47-cjfcp           1/1     Running   0          5m18s
das-operator-5946478b47-npwmn           1/1     Running   0          5m18s
das-operator-webhook-59949d4f85-5n9qt   1/1     Running   0          68s
das-operator-webhook-59949d4f85-nbtdl   1/1     Running   0          68s
das-scheduler-6cc59dbf96-4r85f          1/1     Running   0          68s
das-scheduler-6cc59dbf96-bf6ml          1/1     Running   0          68s
----

. Inspect the logs of the DAS Operator controller by running the following command:
+
[source,terminal]
----
$ oc logs -n das-operator deployment/das-operator
----

. Check the logs of the webhook server by running the following command:
+
[source,terminal]
----
$ oc logs -n das-operator deployment/das-operator-webhook
----

. Check the logs of the scheduler plugin by running the following command:
+
[source,terminal]
----
$ oc logs -n das-operator deployment/das-scheduler
----

. Check the logs of the device plugin daemonset by running the following command:
+
[source,terminal]
----
$ oc logs -n das-operator daemonset/das-daemonset
----

== Monitoring AllocationClaims

.Procedure

. Inspect active `AllocationClaim` resources by running the following command:
+
[source,terminal]
----
$ oc get allocationclaims -n das-operator
----
+
.Example output
+
[source,terminal]
----
NAME                                                                                           AGE
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0   5m
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0   5m
----

. View detailed information about a specific `AllocationClaim` by running the following command:
+
[source,terminal]
----
$ oc get allocationclaims -n das-operator -o yaml
----
+
.Example output (truncated)
+
[source,yaml]
----
apiVersion: inference.redhat.com/v1alpha1
kind: AllocationClaim
metadata:
  name: 13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0
  namespace: das-operator
spec:
  gpuUUID: GPU-9003fd9c-1ad1-c935-d8cd-d1ae69ef17c0
  migPlacement:
    size: 1
    start: 0
  nodename: harpatil000034jma-qh5fm-worker-f-57md9
  podRef:
    kind: Pod
    name: cuda-vectoradd-f4b84b678-l2m69
    namespace: default
    uid: 13950288-57df-4ab5-82bc-6138f646633e
  profile: 1g.5gb
status:
  conditions:
  - lastTransitionTime: "2025-08-06T19:28:48Z"
    message: Allocation is inUse
    reason: inUse
    status: "True"
    type: State
  state: inUse
----

. Check for claims in different states by running the following command:
+
[source,terminal]
----
$ oc get allocationclaims -n das-operator -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.state}{"\n"}{end}'
----
+
.Example output
[source,terminal]
----
13950288-57df-4ab5-82bc-6138f646633e-harpatil000034jma-qh5fm-worker-f-57md9-cuda-vectoradd-0	inUse
ce997b60-a0b8-4ea4-9107-cf59b425d049-harpatil000034jma-qh5fm-worker-f-fl4wg-cuda-vectoradd-0	inUse
----

. View events related to `AllocationClaim` resources by running the following command:
+
[source,terminal]
----
$ oc get events -n das-operator --field-selector involvedObject.kind=AllocationClaim
----

. Check `NodeAccelerator` resources to verify GPU hardware detection by running the following command:
+
[source,terminal]
----
$ oc get nodeaccelerator -n das-operator
----
+
.Example output
[source,terminal]
----
NAME                                     AGE
harpatil000034jma-qh5fm-worker-f-57md9   96m
harpatil000034jma-qh5fm-worker-f-fl4wg   96m
----
+
The `NodeAccelerator` resources represent the GPU-capable nodes detected by the DAS Operator.

.Additional information

The `AllocationClaim` custom resource tracks the following information:

GPU UUID:: The unique identifier of the GPU device.
Slice position:: The position of the MIG slice on the GPU.
Pod reference:: The pod that requested the GPU slice.
State:: The current state of the claim (`staged`, `created`, or `released`).

Claims start in the `staged` state and transition to `created` when all requests are satisfied. When a pod is deleted, the associated claim is automatically cleaned up.

== Verifying GPU device availability

.Procedure

. On a node with GPU hardware, verify that CDI devices were created by running the following command:
+
[source,terminal]
----
$ oc debug node/<node-name>
----
+
[source,terminal]
----
sh-4.4# chroot /host
sh-4.4# ls -l /var/run/cdi/
----

. Check the NVIDIA GPU Operator status by running the following command:
+
[source,terminal]
----
$ oc get clusterpolicies.nvidia.com -o jsonpath='{.items[0].status.state}'
----
+
The output should show `ready`.

== Increasing log verbosity

.Procedure

To get more detailed debugging information:

. Edit the `DASOperator` resource to increase log verbosity by running the following command:
+
[source,terminal]
----
$ oc edit dasoperator -n das-operator
----

. Set the `operatorLogLevel` field to `Debug` or `Trace`:
+
[source,yaml]
----
spec:
  operatorLogLevel: Debug
----

. Save the changes and verify that the operator pods restart with increased verbosity.

== Common issues and solutions

.Pods stuck in UnexpectedAdmissionError state
[NOTE]
====
Due to link:https://github.com/kubernetes/kubernetes/issues/128043[kubernetes/kubernetes#128043], pods might enter an `UnexpectedAdmissionError` state if admission fails. Pods managed by higher level controllers such as Deployments are recreated automatically. Naked pods, however, must be cleaned up manually with `oc delete pod`. Using controllers is recommended until the upstream issue is resolved.
====

.Prerequisites not met
If the DAS Operator fails to start or function properly, verify that all prerequisites are installed:

* Cert-manager
* Node Feature Discovery (NFD) Operator
* NVIDIA GPU Operator