1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00
Files
openshift-docs/modules/cnf-topology-aware-lifecycle-manager-troubleshooting.adoc
2026-01-28 14:11:54 +00:00

500 lines
17 KiB
Plaintext

// Module included in the following assemblies:
//
// * edge_computing/cnf-talm-for-cluster-upgrades.adoc
:_mod-docs-content-type: PROCEDURE
[id="talo-troubleshooting_{context}"]
= Troubleshooting the {cgu-operator-full}
The {cgu-operator-first} is an {product-title} Operator that remediates {rh-rhacm} policies. When issues occur, use the `oc adm must-gather` command to gather details and logs and to take steps in debugging the issues.
For more information about related topics, see the following documentation:
* link:https://access.redhat.com/articles/6218901[Red Hat Advanced Cluster Management for Kubernetes 2.4 Support Matrix]
* link:https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.0/html/troubleshooting/troubleshooting[Red Hat Advanced Cluster Management Troubleshooting]
* The "Troubleshooting Operator issues" section
[id="talo-general-troubleshooting_{context}"]
== General troubleshooting
You can determine the cause of the problem by reviewing the following questions:
* Is the configuration that you are applying supported?
** Are the {rh-rhacm} and the {product-title} versions compatible?
** Are the {cgu-operator} and {rh-rhacm} versions compatible?
* Which of the following components is causing the problem?
** <<talo-troubleshooting-managed-policies_{context}>>
** <<talo-troubleshooting-clusters_{context}>>
** <<talo-troubleshooting-remediation-strategy_{context}>>
** <<talo-troubleshooting-remediation-talo_{context}>>
To ensure that the `ClusterGroupUpgrade` configuration is functional, you can do the following:
. Create the `ClusterGroupUpgrade` CR with the `spec.enable` field set to `false`.
. Wait for the status to be updated and go through the troubleshooting questions.
. If everything looks as expected, set the `spec.enable` field to `true` in the `ClusterGroupUpgrade` CR.
[WARNING]
====
After you set the `spec.enable` field to `true` in the `ClusterUpgradeGroup` CR, the update procedure starts and you cannot edit the CR's `spec` fields anymore.
====
[id="talo-troubleshooting-modify-cgu_{context}"]
== Cannot modify the ClusterUpgradeGroup CR
Issue:: You cannot edit the `ClusterUpgradeGroup` CR after enabling the update.
Resolution:: Restart the procedure by performing the following steps:
+
. Remove the old `ClusterGroupUpgrade` CR by running the following command:
+
[source,terminal]
----
$ oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>
----
+
. Check and fix the existing issues with the managed clusters and policies.
.. Ensure that all the clusters are managed clusters and available.
.. Ensure that all the policies exist and have the `spec.remediationAction` field set to `inform`.
+
. Create a new `ClusterGroupUpgrade` CR with the correct configurations.
+
[source,terminal]
----
$ oc apply -f <ClusterGroupUpgradeCR_YAML>
----
[id="talo-troubleshooting-managed-policies_{context}"]
== Managed policies
[discrete]
== Checking managed policies on the system
Issue:: You want to check if you have the correct managed policies on the system.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'
----
+
.Example output
+
[source,json]
----
["group-du-sno-validator-du-validator-policy", "policy2-common-nto-sub-policy", "policy3-common-ptp-sub-policy"]
----
[discrete]
== Checking remediationAction mode
Issue:: You want to check if the `remediationAction` field is set to `inform` in the `spec` of the managed policies.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get policies --all-namespaces
----
+
.Example output
+
[source,terminal]
----
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
default policy1-common-cluster-version-policy inform NonCompliant 5d21h
default policy2-common-nto-sub-policy inform Compliant 5d21h
default policy3-common-ptp-sub-policy inform NonCompliant 5d21h
default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
----
[discrete]
== Checking policy compliance state
Issue:: You want to check the compliance state of policies.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get policies --all-namespaces
----
+
.Example output
+
[source,terminal]
----
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
default policy1-common-cluster-version-policy inform NonCompliant 5d21h
default policy2-common-nto-sub-policy inform Compliant 5d21h
default policy3-common-ptp-sub-policy inform NonCompliant 5d21h
default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
----
[id="talo-troubleshooting-clusters_{context}"]
== Clusters
[discrete]
=== Checking if managed clusters are present
Issue:: You want to check if the clusters in the `ClusterGroupUpgrade` CR are managed clusters.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get managedclusters
----
+
.Example output
+
[source,terminal]
----
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
local-cluster true https://api.hub.example.com:6443 True Unknown 13d
spoke1 true https://api.spoke1.example.com:6443 True True 13d
spoke3 true https://api.spoke3.example.com:6443 True True 27h
----
. Alternatively, check the {cgu-operator} manager logs:
.. Get the name of the {cgu-operator} manager by running the following command:
+
[source,terminal]
----
$ oc get pod -n openshift-operators
----
+
.Example output
+
[source,terminal]
----
NAME READY STATUS RESTARTS AGE
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp 2/2 Running 0 45m
----
.. Check the {cgu-operator} manager logs by running the following command:
+
[source,terminal]
----
$ oc logs -n openshift-operators \
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
----
+
.Example output
+
[source,terminal]
----
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} <1>
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
----
<1> The error message shows that the cluster is not a managed cluster.
[discrete]
=== Checking if managed clusters are available
Issue:: You want to check if the managed clusters specified in the `ClusterGroupUpgrade` CR are available.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get managedclusters
----
+
.Example output
+
[source,terminal]
----
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
local-cluster true https://api.hub.testlab.com:6443 True Unknown 13d
spoke1 true https://api.spoke1.testlab.com:6443 True True 13d <1>
spoke3 true https://api.spoke3.testlab.com:6443 True True 27h <1>
----
<1> The value of the `AVAILABLE` field is `True` for the managed clusters.
[discrete]
=== Checking clusterLabelSelector
Issue:: You want to check if the `clusterLabelSelector` field specified in the `ClusterGroupUpgrade` CR matches at least one of the managed clusters.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get managedcluster --selector=upgrade=true <1>
----
<1> The label for the clusters you want to update is `upgrade:true`.
+
.Example output
+
[source,terminal]
----
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
spoke1 true https://api.spoke1.testlab.com:6443 True True 13d
spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
----
[discrete]
=== Checking if canary clusters are present
Issue:: You want to check if the canary clusters are present in the list of clusters.
+
.Example `ClusterGroupUpgrade` CR
[source,yaml]
----
spec:
remediationStrategy:
canaries:
- spoke3
maxConcurrency: 2
timeout: 240
clusterLabelSelectors:
- matchLabels:
upgrade: true
----
Resolution:: Run the following commands:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'
----
+
.Example output
+
[source,json]
----
["spoke1", "spoke3"]
----
. Check if the canary clusters are present in the list of clusters that match `clusterLabelSelector` labels by running the following command:
+
[source,terminal]
----
$ oc get managedcluster --selector=upgrade=true
----
+
.Example output
+
[source,terminal]
----
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
spoke1 true https://api.spoke1.testlab.com:6443 True True 13d
spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
----
[NOTE]
====
A cluster can be present in `spec.clusters` and also be matched by the `spec.clusterLabelSelector` label.
====
[discrete]
=== Checking the pre-caching status on spoke clusters
. Check the status of pre-caching by running the following command on the spoke cluster:
+
[source,terminal]
----
$ oc get jobs,pods -n openshift-talo-pre-cache
----
[id="talo-troubleshooting-remediation-strategy_{context}"]
== Remediation Strategy
[discrete]
=== Checking if remediationStrategy is present in the ClusterGroupUpgrade CR
Issue:: You want to check if the `remediationStrategy` is present in the `ClusterGroupUpgrade` CR.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'
----
+
.Example output
+
[source,json]
----
{"maxConcurrency":2, "timeout":240}
----
[discrete]
=== Checking if maxConcurrency is specified in the ClusterGroupUpgrade CR
Issue:: You want to check if the `maxConcurrency` is specified in the `ClusterGroupUpgrade` CR.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'
----
+
.Example output
+
[source,terminal]
----
2
----
[id="talo-troubleshooting-remediation-talo_{context}"]
== {cgu-operator-full}
[discrete]
=== Checking condition message and status in the ClusterGroupUpgrade CR
Issue:: You want to check the value of the `status.conditions` field in the `ClusterGroupUpgrade` CR.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'
----
+
.Example output
+
[source,json]
----
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}
----
[discrete]
=== Checking if status.remediationPlan was computed
Issue:: You want to check if `status.remediationPlan` is computed.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'
----
+
.Example output
+
[source,json]
----
[["spoke2", "spoke3"]]
----
[discrete]
=== Errors in the {cgu-operator} manager container
Issue:: You want to check the logs of the manager container of {cgu-operator}.
Resolution:: Run the following command:
+
[source,terminal]
----
$ oc logs -n openshift-operators \
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
----
+
.Example output
+
[source,terminal]
----
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} <1>
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
----
<1> Displays the error.
[discrete]
=== Clusters are not compliant to some policies after a `ClusterGroupUpgrade` CR has completed
Issue:: The policy compliance status that {cgu-operator} uses to decide if remediation is needed has not yet fully updated for all clusters.
This may be because:
* The CGU was run too soon after a policy was created or updated.
* The remediation of a policy affects the compliance of subsequent policies in the `ClusterGroupUpgrade` CR.
Resolution:: Create and apply a new `ClusterGroupUpdate` CR with the same specification.
[discrete]
[id="talo-troubleshooting-auto-create-policies_{context}"]
=== Auto-created `ClusterGroupUpgrade` CR in the {ztp} workflow has no managed policies
Issue:: If there are no policies for the managed cluster when the cluster becomes `Ready`, a `ClusterGroupUpgrade` CR with no policies is auto-created.
Upon completion of the `ClusterGroupUpgrade` CR, the managed cluster is labeled as `ztp-done`.
If the `PolicyGenerator` or `PolicyGenTemplate` CRs were not pushed to the Git repository within the required time after `ClusterInstance` resources were pushed, this might result in no policies being available for the target cluster when the cluster became `Ready`.
Resolution:: Verify that the policies you want to apply are available on the hub cluster, then create a `ClusterGroupUpgrade` CR with the required policies.
You can either manually create the `ClusterGroupUpgrade` CR or trigger auto-creation again. To trigger auto-creation of the `ClusterGroupUpgrade` CR, remove the `ztp-done` label from the cluster and delete the empty `ClusterGroupUpgrade` CR that was previously created in the `zip-install` namespace.
[discrete]
[id="talo-troubleshooting-pre-cache-failed_{context}"]
=== Pre-caching has failed
Issue:: Pre-caching might fail for one of the following reasons:
* There is not enough free space on the node.
* For a disconnected environment, the pre-cache image has not been properly mirrored.
* There was an issue when creating the pod.
Resolution::
. To check if pre-caching has failed due to insufficient space, check the log of the pre-caching pod in the node.
.. Find the name of the pod using the following command:
+
[source,terminal]
----
$ oc get pods -n openshift-talo-pre-cache
----
+
.. Check the logs to see if the error is related to insufficient space using the following command:
+
[source,terminal]
----
$ oc logs -n openshift-talo-pre-cache <pod name>
----
+
. If there is no log, check the pod status using the following command:
+
[source,terminal]
----
$ oc describe pod -n openshift-talo-pre-cache <pod name>
----
+
. If the pod does not exist, check the job status to see why it could not create a pod using the following command:
+
[source,terminal]
----
$ oc describe job -n openshift-talo-pre-cache pre-cache
----
[discrete]
[id="talo-troubleshooting-pre-placement-tolerations_{context}"]
=== Matching policies and `ManagedCluster` CRs before the managed cluster is available
Issue:: You want {rh-rhacm} to match policies and managed clusters before the managed clusters become available.
+
Resolution:: To ensure that {cgu-operator} correctly applies the {rh-rhacm} policies specified in the `spec.managedPolicies` field of the `ClusterGroupUpgrade` (CGU) CR, {cgu-operator} needs to match these policies to the managed cluster before the managed cluster is available.
The {rh-rhacm} `PolicyGenerator` uses the generated `Placement` CR to do this automatically.
By default, this `Placement` CR includes the necessary tolerations to ensure proper {cgu-operator} behavior.
+
The expected `spec.tolerations` settings in the `Placement` CR are as follows:
+
[source,yaml]
----
#…​
tolerations:
- key: cluster.open-cluster-management.io/unavailable
operator: Exists
- key: cluster.open-cluster-management.io/unreachable
operator: Exists
#…​
----
+
If you use a custom `Placement` CR instead of the one generated by the {rh-rhacm} `PolicyGenerator`, include these tolerations in that `Placement` CR.
+
For more information on placements in {rh-rhacm}, see link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/clusters/index#placement-overview[Placement overview].
+
For more information on tolerations in {rh-rhacm}, see link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/latest/html-single/clusters/index#taints-tolerations-managed[Placing managed clusters by using taints and tolerations].