openshift-docs/modules/cnf-topology-aware-lifecycle-manager-troubleshooting.adoc

// Module included in the following assemblies:
//
// * edge_computing/cnf-talm-for-cluster-upgrades.adoc

:_mod-docs-content-type: PROCEDURE
[id="talo-troubleshooting_{context}"]
= Troubleshooting the {cgu-operator-full}

The {cgu-operator-first} is an {product-title} Operator that remediates {rh-rhacm} policies. When issues occur, use the `oc adm must-gather` command to gather details and logs and to take steps in debugging the issues.


For more information about related topics, see the following documentation:

* link:https://access.redhat.com/articles/6218901[Red Hat Advanced Cluster Management for Kubernetes 2.4 Support Matrix]

* link:https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.0/html/troubleshooting/troubleshooting[Red Hat Advanced Cluster Management Troubleshooting]

* The "Troubleshooting Operator issues" section

[id="talo-general-troubleshooting_{context}"]
== General troubleshooting

You can determine the cause of the problem by reviewing the following questions:

* Is the configuration that you are applying supported?
** Are the {rh-rhacm} and the {product-title} versions compatible?
** Are the {cgu-operator} and {rh-rhacm} versions compatible?
* Which of the following components is causing the problem?
** <<talo-troubleshooting-managed-policies_{context}>>
** <<talo-troubleshooting-clusters_{context}>>
** <<talo-troubleshooting-remediation-strategy_{context}>>
** <<talo-troubleshooting-remediation-talo_{context}>>

To ensure that the `ClusterGroupUpgrade` configuration is functional, you can do the following:

. Create the `ClusterGroupUpgrade` CR with the `spec.enable` field set to `false`.

. Wait for the status to be updated and go through the troubleshooting questions.

. If everything looks as expected, set the `spec.enable` field to `true` in the `ClusterGroupUpgrade` CR.

[WARNING]
====
After you set the `spec.enable` field to `true` in the `ClusterUpgradeGroup` CR, the update procedure starts and you cannot edit the CR's `spec` fields anymore.
====

[id="talo-troubleshooting-modify-cgu_{context}"]
== Cannot modify the ClusterUpgradeGroup CR

Issue:: You cannot edit the `ClusterUpgradeGroup` CR after enabling the update.

Resolution:: Restart the procedure by performing the following steps:
+
. Remove the old `ClusterGroupUpgrade` CR by running the following command:
+
[source,terminal]
----
$ oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>
----
+
. Check and fix the existing issues with the managed clusters and policies.
.. Ensure that all the clusters are managed clusters and available.
.. Ensure that all the policies exist and have the `spec.remediationAction` field set to `inform`.
+
. Create a new `ClusterGroupUpgrade` CR with the correct configurations.
+
[source,terminal]
----
$ oc apply -f <ClusterGroupUpgradeCR_YAML>
----

[id="talo-troubleshooting-managed-policies_{context}"]
== Managed policies

[discrete]
== Checking managed policies on the system

Issue:: You want to check if you have the correct managed policies on the system.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'
----
+
.Example output
+
[source,json]
----
["group-du-sno-validator-du-validator-policy", "policy2-common-nto-sub-policy", "policy3-common-ptp-sub-policy"]
----

[discrete]
== Checking remediationAction mode

Issue:: You want to check if the `remediationAction` field is set to `inform` in the `spec` of the managed policies.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get policies --all-namespaces
----
+
.Example output
+
[source,terminal]
----
NAMESPACE   NAME                                                 REMEDIATION ACTION   COMPLIANCE STATE   AGE
default     policy1-common-cluster-version-policy                inform               NonCompliant       5d21h
default     policy2-common-nto-sub-policy                        inform               Compliant          5d21h
default     policy3-common-ptp-sub-policy                        inform               NonCompliant       5d21h
default     policy4-common-sriov-sub-policy                      inform               NonCompliant       5d21h
----

[discrete]
== Checking policy compliance state

Issue:: You want to check the compliance state of policies.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get policies --all-namespaces
----
+
.Example output
+
[source,terminal]
----
NAMESPACE   NAME                                                 REMEDIATION ACTION   COMPLIANCE STATE   AGE
default     policy1-common-cluster-version-policy                inform               NonCompliant       5d21h
default     policy2-common-nto-sub-policy                        inform               Compliant          5d21h
default     policy3-common-ptp-sub-policy                        inform               NonCompliant       5d21h
default     policy4-common-sriov-sub-policy                      inform               NonCompliant       5d21h
----

[id="talo-troubleshooting-clusters_{context}"]
== Clusters

[discrete]
=== Checking if managed clusters are present

Issue:: You want to check if the clusters in the `ClusterGroupUpgrade` CR are managed clusters.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get managedclusters
----
+
.Example output
+
[source,terminal]
----
NAME            HUB ACCEPTED   MANAGED CLUSTER URLS                    JOINED   AVAILABLE   AGE
local-cluster   true           https://api.hub.example.com:6443        True     Unknown     13d
spoke1          true           https://api.spoke1.example.com:6443     True     True        13d
spoke3          true           https://api.spoke3.example.com:6443     True     True        27h
----

. Alternatively, check the {cgu-operator} manager logs:

.. Get the name of the {cgu-operator} manager by running the following command:
+
[source,terminal]
----
$ oc get pod -n openshift-operators
----
+
.Example output
+
[source,terminal]
----
NAME                                                         READY   STATUS    RESTARTS   AGE
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp   2/2     Running   0          45m
----

.. Check the {cgu-operator} manager logs by running the following command:
+
[source,terminal]
----
$ oc logs -n openshift-operators \
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
----
+
.Example output
+
[source,terminal]
----
ERROR	controller-runtime.manager.controller.clustergroupupgrade	Reconciler error	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} <1>
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
----
<1> The error message shows that the cluster is not a managed cluster.

[discrete]
=== Checking if managed clusters are available

Issue:: You want to check if the managed clusters specified in the `ClusterGroupUpgrade` CR are available.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get managedclusters
----
+
.Example output
+
[source,terminal]
----
NAME            HUB ACCEPTED   MANAGED CLUSTER URLS                    JOINED   AVAILABLE   AGE
local-cluster   true           https://api.hub.testlab.com:6443        True     Unknown     13d
spoke1          true           https://api.spoke1.testlab.com:6443     True     True        13d <1>
spoke3          true           https://api.spoke3.testlab.com:6443     True     True        27h <1>
----
<1> The value of the `AVAILABLE` field is `True` for the managed clusters.

[discrete]
=== Checking clusterLabelSelector

Issue:: You want to check if the `clusterLabelSelector` field specified in the `ClusterGroupUpgrade` CR matches at least one of the managed clusters.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get managedcluster --selector=upgrade=true <1>
----
<1> The label for the clusters you want to update is `upgrade:true`.
+
.Example output
+
[source,terminal]
----
NAME            HUB ACCEPTED   MANAGED CLUSTER URLS                     JOINED    AVAILABLE   AGE
spoke1          true           https://api.spoke1.testlab.com:6443      True     True        13d
spoke3          true           https://api.spoke3.testlab.com:6443      True     True        27h
----

[discrete]
=== Checking if canary clusters are present

Issue:: You want to check if the canary clusters are present in the list of clusters.
+
.Example `ClusterGroupUpgrade` CR
[source,yaml]
----
spec:
    remediationStrategy:
        canaries:
        - spoke3
        maxConcurrency: 2
        timeout: 240
    clusterLabelSelectors:
      - matchLabels:
          upgrade: true
----

Resolution:: Run the following commands:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'
----
+
.Example output
+
[source,json]
----
["spoke1", "spoke3"]
----

. Check if the canary clusters are present in the list of clusters that match `clusterLabelSelector` labels by running the following command:
+
[source,terminal]
----
$ oc get managedcluster --selector=upgrade=true
----
+
.Example output
+
[source,terminal]
----
NAME            HUB ACCEPTED   MANAGED CLUSTER URLS   JOINED    AVAILABLE   AGE
spoke1          true           https://api.spoke1.testlab.com:6443   True     True        13d
spoke3          true           https://api.spoke3.testlab.com:6443   True     True        27h
----

[NOTE]
====
A cluster can be present in `spec.clusters` and also be matched by the `spec.clusterLabelSelector` label.
====

[discrete]
=== Checking the pre-caching status on spoke clusters

. Check the status of pre-caching by running the following command on the spoke cluster:
+
[source,terminal]
----
$ oc get jobs,pods -n openshift-talo-pre-cache
----

[id="talo-troubleshooting-remediation-strategy_{context}"]
== Remediation Strategy

[discrete]
=== Checking if remediationStrategy is present in the ClusterGroupUpgrade CR

Issue:: You want to check if the `remediationStrategy` is present in the `ClusterGroupUpgrade` CR.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'
----
+
.Example output
+
[source,json]
----
{"maxConcurrency":2, "timeout":240}
----

[discrete]
=== Checking if maxConcurrency is specified in the ClusterGroupUpgrade CR

Issue:: You want to check if the `maxConcurrency` is specified in the `ClusterGroupUpgrade` CR.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'
----
+
.Example output
+
[source,terminal]
----
2
----

[id="talo-troubleshooting-remediation-talo_{context}"]
== {cgu-operator-full}

[discrete]
=== Checking condition message and status in the ClusterGroupUpgrade CR

Issue:: You want to check the value of the `status.conditions` field in the `ClusterGroupUpgrade` CR.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'
----
+
.Example output
+
[source,json]
----
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}
----

[discrete]
=== Checking if status.remediationPlan was computed

Issue:: You want to check if `status.remediationPlan` is computed.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'
----
+
.Example output
+
[source,json]
----
[["spoke2", "spoke3"]]
----

[discrete]
=== Errors in the {cgu-operator} manager container

Issue:: You want to check the logs of the manager container of {cgu-operator}.

Resolution:: Run the following command:
+
[source,terminal]
----
$ oc logs -n openshift-operators \
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
----
+
.Example output
+
[source,terminal]
----
ERROR	controller-runtime.manager.controller.clustergroupupgrade	Reconciler error	{"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} <1>
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
----
<1> Displays the error.

[discrete]
=== Clusters are not compliant to some policies after a `ClusterGroupUpgrade` CR has completed

Issue:: The policy compliance status that {cgu-operator} uses to decide if remediation is needed has not yet fully updated for all clusters.
This may be because:
* The CGU was run too soon after a policy was created or updated.
* The remediation of a policy affects the compliance of subsequent policies in the `ClusterGroupUpgrade` CR.

Resolution:: Create and apply a new `ClusterGroupUpdate` CR with the same specification.

[discrete]
[id="talo-troubleshooting-auto-create-policies_{context}"]
=== Auto-created `ClusterGroupUpgrade` CR in the {ztp} workflow has no managed policies

Issue:: If there are no policies for the managed cluster when the cluster becomes `Ready`, a `ClusterGroupUpgrade` CR with no policies is auto-created.
Upon completion of the `ClusterGroupUpgrade` CR, the managed cluster is labeled as `ztp-done`.
If the `PolicyGenerator` or `PolicyGenTemplate` CRs were not pushed to the Git repository within the required time after `ClusterInstance` resources were pushed, this might result in no policies being available for the target cluster when the cluster became `Ready`.

Resolution:: Verify that the policies you want to apply are available on the hub cluster, then create a `ClusterGroupUpgrade` CR with the required policies.

You can either manually create the `ClusterGroupUpgrade` CR or trigger auto-creation again. To trigger auto-creation of the `ClusterGroupUpgrade` CR, remove the `ztp-done` label from the cluster and delete the empty `ClusterGroupUpgrade` CR that was previously created in the `zip-install` namespace.

[discrete]
[id="talo-troubleshooting-pre-cache-failed_{context}"]
=== Pre-caching has failed

Issue:: Pre-caching might fail for one of the following reasons:
* There is not enough free space on the node.
* For a disconnected environment, the pre-cache image has not been properly mirrored.
* There was an issue when creating the pod.

Resolution::
. To check if pre-caching has failed due to insufficient space, check the log of the pre-caching pod in the node.
.. Find the name of the pod using the following command:
+
[source,terminal]
----
$ oc get pods -n openshift-talo-pre-cache
----
+
.. Check the logs to see if the error is related to insufficient space using the following command:
+
[source,terminal]
----
$ oc logs -n openshift-talo-pre-cache <pod name>
----
+
. If there is no log, check the pod status using the following command:
+
[source,terminal]
----
$ oc describe pod -n openshift-talo-pre-cache <pod name>
----
+
. If the pod does not exist, check the job status to see why it could not create a pod using the following command:
+
[source,terminal]
----
$ oc describe job -n openshift-talo-pre-cache pre-cache
----

[discrete]
[id="talo-troubleshooting-pre-placement-tolerations_{context}"]
=== Matching policies and `ManagedCluster` CRs before the managed cluster is available

Issue:: You want {rh-rhacm} to match policies and managed clusters before the managed clusters become available.
+
Resolution:: To ensure that {cgu-operator} correctly applies the {rh-rhacm} policies specified in the `spec.managedPolicies` field of the `ClusterGroupUpgrade` (CGU) CR, {cgu-operator} needs to match these policies to the managed cluster before the managed cluster is available.
The {rh-rhacm} `PolicyGenerator` uses the generated `Placement` CR to do this automatically.
By default, this `Placement` CR includes the necessary tolerations to ensure proper {cgu-operator} behavior.
+
The expected `spec.tolerations` settings in the `Placement` CR are as follows:
+
[source,yaml]
----
#…
  tolerations:
    - key: cluster.open-cluster-management.io/unavailable
     operator: Exists
    - key: cluster.open-cluster-management.io/unreachable
     operator: Exists
#…
----
+
If you use a custom `Placement` CR instead of the one generated by the {rh-rhacm} `PolicyGenerator`, include these tolerations in that `Placement` CR.
+
For more information on placements in {rh-rhacm}, see link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/clusters/index#placement-overview[Placement overview].
+
For more information on tolerations in {rh-rhacm}, see link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/latest/html-single/clusters/index#taints-tolerations-managed[Placing managed clusters by using taints and tolerations].