mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-05 12:46:18 +01:00
500 lines
17 KiB
Plaintext
500 lines
17 KiB
Plaintext
// Module included in the following assemblies:
|
|
//
|
|
// * edge_computing/cnf-talm-for-cluster-upgrades.adoc
|
|
|
|
:_mod-docs-content-type: PROCEDURE
|
|
[id="talo-troubleshooting_{context}"]
|
|
= Troubleshooting the {cgu-operator-full}
|
|
|
|
The {cgu-operator-first} is an {product-title} Operator that remediates {rh-rhacm} policies. When issues occur, use the `oc adm must-gather` command to gather details and logs and to take steps in debugging the issues.
|
|
|
|
|
|
For more information about related topics, see the following documentation:
|
|
|
|
* link:https://access.redhat.com/articles/6218901[Red Hat Advanced Cluster Management for Kubernetes 2.4 Support Matrix]
|
|
|
|
* link:https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.0/html/troubleshooting/troubleshooting[Red Hat Advanced Cluster Management Troubleshooting]
|
|
|
|
* The "Troubleshooting Operator issues" section
|
|
|
|
[id="talo-general-troubleshooting_{context}"]
|
|
== General troubleshooting
|
|
|
|
You can determine the cause of the problem by reviewing the following questions:
|
|
|
|
* Is the configuration that you are applying supported?
|
|
** Are the {rh-rhacm} and the {product-title} versions compatible?
|
|
** Are the {cgu-operator} and {rh-rhacm} versions compatible?
|
|
* Which of the following components is causing the problem?
|
|
** <<talo-troubleshooting-managed-policies_{context}>>
|
|
** <<talo-troubleshooting-clusters_{context}>>
|
|
** <<talo-troubleshooting-remediation-strategy_{context}>>
|
|
** <<talo-troubleshooting-remediation-talo_{context}>>
|
|
|
|
To ensure that the `ClusterGroupUpgrade` configuration is functional, you can do the following:
|
|
|
|
. Create the `ClusterGroupUpgrade` CR with the `spec.enable` field set to `false`.
|
|
|
|
. Wait for the status to be updated and go through the troubleshooting questions.
|
|
|
|
. If everything looks as expected, set the `spec.enable` field to `true` in the `ClusterGroupUpgrade` CR.
|
|
|
|
[WARNING]
|
|
====
|
|
After you set the `spec.enable` field to `true` in the `ClusterUpgradeGroup` CR, the update procedure starts and you cannot edit the CR's `spec` fields anymore.
|
|
====
|
|
|
|
[id="talo-troubleshooting-modify-cgu_{context}"]
|
|
== Cannot modify the ClusterUpgradeGroup CR
|
|
|
|
Issue:: You cannot edit the `ClusterUpgradeGroup` CR after enabling the update.
|
|
|
|
Resolution:: Restart the procedure by performing the following steps:
|
|
+
|
|
. Remove the old `ClusterGroupUpgrade` CR by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete cgu -n <ClusterGroupUpgradeCR_namespace> <ClusterGroupUpgradeCR_name>
|
|
----
|
|
+
|
|
. Check and fix the existing issues with the managed clusters and policies.
|
|
.. Ensure that all the clusters are managed clusters and available.
|
|
.. Ensure that all the policies exist and have the `spec.remediationAction` field set to `inform`.
|
|
+
|
|
. Create a new `ClusterGroupUpgrade` CR with the correct configurations.
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc apply -f <ClusterGroupUpgradeCR_YAML>
|
|
----
|
|
|
|
[id="talo-troubleshooting-managed-policies_{context}"]
|
|
== Managed policies
|
|
|
|
[discrete]
|
|
== Checking managed policies on the system
|
|
|
|
Issue:: You want to check if you have the correct managed policies on the system.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get cgu lab-upgrade -ojsonpath='{.spec.managedPolicies}'
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,json]
|
|
----
|
|
["group-du-sno-validator-du-validator-policy", "policy2-common-nto-sub-policy", "policy3-common-ptp-sub-policy"]
|
|
----
|
|
|
|
[discrete]
|
|
== Checking remediationAction mode
|
|
|
|
Issue:: You want to check if the `remediationAction` field is set to `inform` in the `spec` of the managed policies.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get policies --all-namespaces
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
|
|
default policy1-common-cluster-version-policy inform NonCompliant 5d21h
|
|
default policy2-common-nto-sub-policy inform Compliant 5d21h
|
|
default policy3-common-ptp-sub-policy inform NonCompliant 5d21h
|
|
default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
|
|
----
|
|
|
|
[discrete]
|
|
== Checking policy compliance state
|
|
|
|
Issue:: You want to check the compliance state of policies.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get policies --all-namespaces
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
NAMESPACE NAME REMEDIATION ACTION COMPLIANCE STATE AGE
|
|
default policy1-common-cluster-version-policy inform NonCompliant 5d21h
|
|
default policy2-common-nto-sub-policy inform Compliant 5d21h
|
|
default policy3-common-ptp-sub-policy inform NonCompliant 5d21h
|
|
default policy4-common-sriov-sub-policy inform NonCompliant 5d21h
|
|
----
|
|
|
|
[id="talo-troubleshooting-clusters_{context}"]
|
|
== Clusters
|
|
|
|
[discrete]
|
|
=== Checking if managed clusters are present
|
|
|
|
Issue:: You want to check if the clusters in the `ClusterGroupUpgrade` CR are managed clusters.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get managedclusters
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
|
|
local-cluster true https://api.hub.example.com:6443 True Unknown 13d
|
|
spoke1 true https://api.spoke1.example.com:6443 True True 13d
|
|
spoke3 true https://api.spoke3.example.com:6443 True True 27h
|
|
----
|
|
|
|
. Alternatively, check the {cgu-operator} manager logs:
|
|
|
|
.. Get the name of the {cgu-operator} manager by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get pod -n openshift-operators
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
NAME READY STATUS RESTARTS AGE
|
|
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp 2/2 Running 0 45m
|
|
----
|
|
|
|
.. Check the {cgu-operator} manager logs by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc logs -n openshift-operators \
|
|
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} <1>
|
|
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
|
|
----
|
|
<1> The error message shows that the cluster is not a managed cluster.
|
|
|
|
[discrete]
|
|
=== Checking if managed clusters are available
|
|
|
|
Issue:: You want to check if the managed clusters specified in the `ClusterGroupUpgrade` CR are available.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get managedclusters
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
|
|
local-cluster true https://api.hub.testlab.com:6443 True Unknown 13d
|
|
spoke1 true https://api.spoke1.testlab.com:6443 True True 13d <1>
|
|
spoke3 true https://api.spoke3.testlab.com:6443 True True 27h <1>
|
|
----
|
|
<1> The value of the `AVAILABLE` field is `True` for the managed clusters.
|
|
|
|
[discrete]
|
|
=== Checking clusterLabelSelector
|
|
|
|
Issue:: You want to check if the `clusterLabelSelector` field specified in the `ClusterGroupUpgrade` CR matches at least one of the managed clusters.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get managedcluster --selector=upgrade=true <1>
|
|
----
|
|
<1> The label for the clusters you want to update is `upgrade:true`.
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
|
|
spoke1 true https://api.spoke1.testlab.com:6443 True True 13d
|
|
spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
|
|
----
|
|
|
|
[discrete]
|
|
=== Checking if canary clusters are present
|
|
|
|
Issue:: You want to check if the canary clusters are present in the list of clusters.
|
|
+
|
|
.Example `ClusterGroupUpgrade` CR
|
|
[source,yaml]
|
|
----
|
|
spec:
|
|
remediationStrategy:
|
|
canaries:
|
|
- spoke3
|
|
maxConcurrency: 2
|
|
timeout: 240
|
|
clusterLabelSelectors:
|
|
- matchLabels:
|
|
upgrade: true
|
|
----
|
|
|
|
Resolution:: Run the following commands:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get cgu lab-upgrade -ojsonpath='{.spec.clusters}'
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,json]
|
|
----
|
|
["spoke1", "spoke3"]
|
|
----
|
|
|
|
. Check if the canary clusters are present in the list of clusters that match `clusterLabelSelector` labels by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get managedcluster --selector=upgrade=true
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
NAME HUB ACCEPTED MANAGED CLUSTER URLS JOINED AVAILABLE AGE
|
|
spoke1 true https://api.spoke1.testlab.com:6443 True True 13d
|
|
spoke3 true https://api.spoke3.testlab.com:6443 True True 27h
|
|
----
|
|
|
|
[NOTE]
|
|
====
|
|
A cluster can be present in `spec.clusters` and also be matched by the `spec.clusterLabelSelector` label.
|
|
====
|
|
|
|
[discrete]
|
|
=== Checking the pre-caching status on spoke clusters
|
|
|
|
. Check the status of pre-caching by running the following command on the spoke cluster:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get jobs,pods -n openshift-talo-pre-cache
|
|
----
|
|
|
|
[id="talo-troubleshooting-remediation-strategy_{context}"]
|
|
== Remediation Strategy
|
|
|
|
[discrete]
|
|
=== Checking if remediationStrategy is present in the ClusterGroupUpgrade CR
|
|
|
|
Issue:: You want to check if the `remediationStrategy` is present in the `ClusterGroupUpgrade` CR.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy}'
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,json]
|
|
----
|
|
{"maxConcurrency":2, "timeout":240}
|
|
----
|
|
|
|
[discrete]
|
|
=== Checking if maxConcurrency is specified in the ClusterGroupUpgrade CR
|
|
|
|
Issue:: You want to check if the `maxConcurrency` is specified in the `ClusterGroupUpgrade` CR.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get cgu lab-upgrade -ojsonpath='{.spec.remediationStrategy.maxConcurrency}'
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
2
|
|
----
|
|
|
|
[id="talo-troubleshooting-remediation-talo_{context}"]
|
|
== {cgu-operator-full}
|
|
|
|
[discrete]
|
|
=== Checking condition message and status in the ClusterGroupUpgrade CR
|
|
|
|
Issue:: You want to check the value of the `status.conditions` field in the `ClusterGroupUpgrade` CR.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get cgu lab-upgrade -ojsonpath='{.status.conditions}'
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,json]
|
|
----
|
|
{"lastTransitionTime":"2022-02-17T22:25:28Z", "message":"Missing managed policies:[policyList]", "reason":"NotAllManagedPoliciesExist", "status":"False", "type":"Validated"}
|
|
----
|
|
|
|
[discrete]
|
|
=== Checking if status.remediationPlan was computed
|
|
|
|
Issue:: You want to check if `status.remediationPlan` is computed.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get cgu lab-upgrade -ojsonpath='{.status.remediationPlan}'
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,json]
|
|
----
|
|
[["spoke2", "spoke3"]]
|
|
----
|
|
|
|
[discrete]
|
|
=== Errors in the {cgu-operator} manager container
|
|
|
|
Issue:: You want to check the logs of the manager container of {cgu-operator}.
|
|
|
|
Resolution:: Run the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc logs -n openshift-operators \
|
|
cluster-group-upgrades-controller-manager-75bcc7484d-8k8xp -c manager
|
|
----
|
|
+
|
|
.Example output
|
|
+
|
|
[source,terminal]
|
|
----
|
|
ERROR controller-runtime.manager.controller.clustergroupupgrade Reconciler error {"reconciler group": "ran.openshift.io", "reconciler kind": "ClusterGroupUpgrade", "name": "lab-upgrade", "namespace": "default", "error": "Cluster spoke5555 is not a ManagedCluster"} <1>
|
|
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
|
|
----
|
|
<1> Displays the error.
|
|
|
|
[discrete]
|
|
=== Clusters are not compliant to some policies after a `ClusterGroupUpgrade` CR has completed
|
|
|
|
Issue:: The policy compliance status that {cgu-operator} uses to decide if remediation is needed has not yet fully updated for all clusters.
|
|
This may be because:
|
|
* The CGU was run too soon after a policy was created or updated.
|
|
* The remediation of a policy affects the compliance of subsequent policies in the `ClusterGroupUpgrade` CR.
|
|
|
|
Resolution:: Create and apply a new `ClusterGroupUpdate` CR with the same specification.
|
|
|
|
[discrete]
|
|
[id="talo-troubleshooting-auto-create-policies_{context}"]
|
|
=== Auto-created `ClusterGroupUpgrade` CR in the {ztp} workflow has no managed policies
|
|
|
|
Issue:: If there are no policies for the managed cluster when the cluster becomes `Ready`, a `ClusterGroupUpgrade` CR with no policies is auto-created.
|
|
Upon completion of the `ClusterGroupUpgrade` CR, the managed cluster is labeled as `ztp-done`.
|
|
If the `PolicyGenerator` or `PolicyGenTemplate` CRs were not pushed to the Git repository within the required time after `ClusterInstance` resources were pushed, this might result in no policies being available for the target cluster when the cluster became `Ready`.
|
|
|
|
Resolution:: Verify that the policies you want to apply are available on the hub cluster, then create a `ClusterGroupUpgrade` CR with the required policies.
|
|
|
|
You can either manually create the `ClusterGroupUpgrade` CR or trigger auto-creation again. To trigger auto-creation of the `ClusterGroupUpgrade` CR, remove the `ztp-done` label from the cluster and delete the empty `ClusterGroupUpgrade` CR that was previously created in the `zip-install` namespace.
|
|
|
|
[discrete]
|
|
[id="talo-troubleshooting-pre-cache-failed_{context}"]
|
|
=== Pre-caching has failed
|
|
|
|
Issue:: Pre-caching might fail for one of the following reasons:
|
|
* There is not enough free space on the node.
|
|
* For a disconnected environment, the pre-cache image has not been properly mirrored.
|
|
* There was an issue when creating the pod.
|
|
|
|
Resolution::
|
|
. To check if pre-caching has failed due to insufficient space, check the log of the pre-caching pod in the node.
|
|
.. Find the name of the pod using the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get pods -n openshift-talo-pre-cache
|
|
----
|
|
+
|
|
.. Check the logs to see if the error is related to insufficient space using the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc logs -n openshift-talo-pre-cache <pod name>
|
|
----
|
|
+
|
|
. If there is no log, check the pod status using the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc describe pod -n openshift-talo-pre-cache <pod name>
|
|
----
|
|
+
|
|
. If the pod does not exist, check the job status to see why it could not create a pod using the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc describe job -n openshift-talo-pre-cache pre-cache
|
|
----
|
|
|
|
[discrete]
|
|
[id="talo-troubleshooting-pre-placement-tolerations_{context}"]
|
|
=== Matching policies and `ManagedCluster` CRs before the managed cluster is available
|
|
|
|
Issue:: You want {rh-rhacm} to match policies and managed clusters before the managed clusters become available.
|
|
+
|
|
Resolution:: To ensure that {cgu-operator} correctly applies the {rh-rhacm} policies specified in the `spec.managedPolicies` field of the `ClusterGroupUpgrade` (CGU) CR, {cgu-operator} needs to match these policies to the managed cluster before the managed cluster is available.
|
|
The {rh-rhacm} `PolicyGenerator` uses the generated `Placement` CR to do this automatically.
|
|
By default, this `Placement` CR includes the necessary tolerations to ensure proper {cgu-operator} behavior.
|
|
+
|
|
The expected `spec.tolerations` settings in the `Placement` CR are as follows:
|
|
+
|
|
[source,yaml]
|
|
----
|
|
#…
|
|
tolerations:
|
|
- key: cluster.open-cluster-management.io/unavailable
|
|
operator: Exists
|
|
- key: cluster.open-cluster-management.io/unreachable
|
|
operator: Exists
|
|
#…
|
|
----
|
|
+
|
|
If you use a custom `Placement` CR instead of the one generated by the {rh-rhacm} `PolicyGenerator`, include these tolerations in that `Placement` CR.
|
|
+
|
|
For more information on placements in {rh-rhacm}, see link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/clusters/index#placement-overview[Placement overview].
|
|
+
|
|
For more information on tolerations in {rh-rhacm}, see link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/latest/html-single/clusters/index#taints-tolerations-managed[Placing managed clusters by using taints and tolerations]. |