1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00

TELCODOCS-502: Configuring worker machinehealth check module updated & Configuring control-plane machine health check module added

TELCODOCS-502: Configuring control-plane machine health check module removed / Control plane fencing module added

TELCODOCS-502: QE comments added

TELCODOCS-502: SNR Control plane fencing added

TELCODOCS-502: Dev feedback added

TELCODOCS-502: QE/Dev feedback added

TELCODOCS-502: QE feeback updated

TELCODOCS-502: More Dev/QE feeback included

TELCODOCS-502: More Dev/QE feeback added

TELCODOCS-502: Peer review feeback applied
This commit is contained in:
Padraig O'Grady
2022-10-18 21:42:58 +01:00
committed by openshift-cherrypick-robot
parent f5fabf9a9a
commit 920912ff9c
7 changed files with 150 additions and 15 deletions

View File

@@ -0,0 +1,88 @@
// Module included in the following assemblies:
//
// *nodes/nodes/eco-poison-pill-operator.adoc
:_content-type: PROCEDURE
[id="configuring-control-plane-machine-health-check-with-self-node-remediation-operator_{context}"]
= Configuring control-plane machine health checks to use the Self Node Remediation Operator
Use the following procedure to configure the control-plane machine health checks to use the Self Node Remediation Operator as a remediation provider.
.Prerequisites
* Install the OpenShift CLI (`oc`).
* Log in as a user with `cluster-admin` privileges.
.Procedure
. Create a `SelfNodeRemediationTemplate` CR:
.. Define the `SelfNodeRemediationTemplate` CR:
+
[source,yaml]
----
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationTemplate
metadata:
namespace: openshift-machine-api
name: selfnoderemediationtemplate-sample
spec:
template:
spec:
remediationStrategy: ResourceDeletion <1>
----
<1> Specifies the remediation strategy. The default strategy is `ResourceDeletion`.
.. To create the `SelfNodeRemediationTemplate` CR, run the following command:
+
[source,terminal]
----
$ oc create -f <snrt-name>.yaml
----
. Create or update the `MachineHealthCheck` CR to point to the `SelfNodeRemediationTemplate` CR:
.. Define or update the `MachineHealthCheck` CR:
+
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: machine-health-check
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: "control-plane"
machine.openshift.io/cluster-api-machine-type: "control-plane"
unhealthyConditions:
- type: "Ready"
timeout: "300s"
status: "False"
- type: "Ready"
timeout: "300s"
status: "Unknown"
maxUnhealthy: "40%"
nodeStartupTimeout: "10m"
remediationTemplate: <1>
kind: SelfNodeRemediationTemplate
apiVersion: self-node-remediation.medik8s.io/v1alpha1
name: selfnoderemediationtemplate-sample
----
<1> Specifies the details for the remediation template.
+
.. To create a `MachineHealthCheck` CR, run the following command:
+
[source,terminal]
----
$ oc create -f <mhc-name>.yaml
----
.. To update a `MachineHealthCheck` CR, run the following command:
+
[source,terminal]
----
$ oc apply -f <mhc-name>.yaml
----

View File

@@ -6,7 +6,7 @@
[id="configuring-machine-health-check-with-self-node-remediation-operator_{context}"]
= Configuring machine health checks to use the Self Node Remediation Operator
Use the following procedure to configure the machine health checks to use the Self Node Remediation Operator as a remediation provider.
Use the following procedure to configure the worker or control-plane machine health checks to use the Self Node Remediation Operator as a remediation provider.
.Prerequisites
@@ -37,7 +37,7 @@ spec:
+
[source,terminal]
----
$ oc create -f <snr-name>.yaml
$ oc create -f <snrt-name>.yaml
----
. Create or update the `MachineHealthCheck` CR to point to the `SelfNodeRemediationTemplate` CR:
@@ -53,7 +53,7 @@ metadata:
namespace: openshift-machine-api
spec:
selector:
matchLabels:
matchLabels: <1>
machine.openshift.io/cluster-api-machine-role: "worker"
machine.openshift.io/cluster-api-machine-type: "worker"
unhealthyConditions:
@@ -65,26 +65,25 @@ spec:
status: "Unknown"
maxUnhealthy: "40%"
nodeStartupTimeout: "10m"
remediationTemplate: <1>
remediationTemplate: <2>
kind: SelfNodeRemediationTemplate
apiVersion: self-node-remediation.medik8s.io/v1alpha1
name: selfnoderemediationtemplate-sample
----
<1> Specifies the details for the remediation template.
<1> Selects whether the machine health check is for `worker` or `control-plane` nodes. The label can also be user-defined.
<2> Specifies the details for the remediation template.
+
.. To create a `MachineHealthCheck` CR, run the following command:
+
[source,terminal]
----
$ oc create -f <file-name>.yaml
$ oc create -f <mhc-name>.yaml
----
.. To update a `MachineHealthCheck` CR, run the following command:
+
[source,terminal]
----
$ oc apply -f <file-name>.yaml
$ oc apply -f <mhc-name>.yaml
----

View File

@@ -0,0 +1,16 @@
// Module included in the following assemblies:
//
// * nodes/nodes/eco-node-health-check-operator.adoc
:_content-type: CONCEPT
[id="control-plane-fencing-node-health-check-operator_{context}"]
= Control plane fencing
In earlier releases, you could enable Self Node Remediation and Node Health Check on worker nodes. In the event of node failure, you can now also follow remediation strategies on control plane nodes.
Do not use the same `NodeHealthCheck` CR for worker nodes and control plane nodes. Grouping worker nodes and control plane nodes together can result in incorrect evaluation of the minimum healthy node count, and cause unexpected or missing remediations. This is because of the way the Node Health Check Operator handles control plane nodes. You should group the control plane nodes in their own group and the worker nodes in their own group. If required, you can also create multiple groups of worker nodes.
Considerations for remediation strategies:
* Avoid Node Health Check configurations that involve multiple configurations overlapping the same nodes because they can result in unexpected behavior. This suggestion applies to both worker and control plane nodes.
* The Node Health Check Operator implements a hardcoded limitation of remediating a maximum of one control plane node at a time. Multiple control plane nodes should not be remediated at the same time.

View File

@@ -89,12 +89,13 @@ The Self Node Remediation Operator also creates the `SelfNodeRemediationTemplate
`ResourceDeletion`:: This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. `ResourceDeletion` is the default remediation strategy.
`NodeDeletion`:: This remediation strategy removes the node object.
`NodeDeletion`:: This remediation strategy is deprecated and will be removed in a future release. In the current release, the `ResourceDeletion` strategy is used even if the `NodeDeletion` strategy is selected.
The Self Node Remediation Operator creates the following `SelfNodeRemediationTemplate` CRs for each strategy:
The Self Node Remediation Operator creates the following `SelfNodeRemediationTemplate` CR for the strategy:
* `self-node-remediation-resource-deletion-template`, which the `ResourceDeletion` remediation strategy uses
* `self-node-remediation-node-deletion-template`, which the `NodeDeletion` remediation strategy uses
//* `self-node-remediation-node-deletion-template`, which the `NodeDeletion` remediation strategy uses
The `SelfNodeRemediationTemplate` CR resembles the following YAML file:
@@ -111,5 +112,6 @@ spec:
spec:
remediationStrategy: <remediation_strategy> <2>
----
<1> Specifies the type of remediation template based on the remediation strategy. Replace `<remediation_object>` with either `resource` or `node`, for example, `self-node-remediation-resource-deletion-template`.
<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`.
<1> Specifies the type of remediation template based on the remediation strategy. Replace `<remediation_object>` with either `resource` or `node`; for example, `self-node-remediation-resource-deletion-template`.
//<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`.
<2> Specifies the remediation strategy. The remediation strategy is `ResourceDeletion`.

View File

@@ -0,0 +1,26 @@
// Module included in the following assemblies:
//
// * nodes/nodes/eco-node-health-check-operator.adoc
:_content-type: CONCEPT
[id="control-plane-fencing-self-node-remediation-operator_{context}"]
= Control plane fencing
In earlier releases, you could enable Self Node Remediation and Node Health Check on worker nodes. In the event of node failure, you can now also follow remediation strategies on control plane nodes.
Self Node Remediation occurs in two primary scenarios.
* API Server Connectivity
** In this scenario, the control plane node to be remediated is not isolated. It can be directly connected to the API Server, or it can be indirectly connected to the API Server through worker nodes or control-plane nodes, that are directly connected to the API Server.
** When there is API Server Connectivity, the control plane node is remediated only if the Node Health Check Operator has created a `SelfNodeRemediation` custom resource (CR) for the node.
* No API Server Connectivity
** In this scenario, the control plane node to be remediated is isolated from the API Server. The node cannot connect directly or indirectly to the API Server.
** When there is no API Server Connectivity, the control plane node will be remediated as outlined with these steps:
*** Check the status of the control plane node with the majority of the peer worker nodes. If its status is unhealthy or unknown, even if the control plane node can communicate with the peer worker nodes, the node will be analyzed further.
**** Self-diagnose the status of the control plane node
***** If self diagnostics passed, no action will be taken.
***** If self diagnostics failed, the node will be fenced and remediated.
*** If the node did not manage to communicate to most of its worker peers, check the connectivity of the control plane node with other control plane nodes. If the node can communicate with any other control plane peer, no action will be taken. Otherwise, the node will be fenced and remediated.

View File

@@ -15,6 +15,8 @@ xref:../../nodes/nodes/eco-self-node-remediation-operator.adoc#self-node-remedia
include::modules/eco-node-health-check-operator-about.adoc[leveloffset=+1]
include::modules/eco-node-health-check-operator-control-plane-fencing.adoc[leveloffset=+1]
include::modules/eco-node-health-check-operator-installation-web-console.adoc[leveloffset=+1]
include::modules/eco-node-health-check-operator-installation-cli.adoc[leveloffset=+1]

View File

@@ -15,6 +15,8 @@ include::modules/eco-self-node-remediation-about-watchdog.adoc[leveloffset=+2]
.Additional resources
xref:../../virt/virtual_machines/advanced_vm_management/virt-configuring-a-watchdog.adoc#virt-configuring-a-watchdog[Configuring a watchdog]
include::modules/eco-self-node-remediation-operator-control-plane-fencing.adoc[leveloffset=+1]
include::modules/eco-self-node-remediation-operator-installation-web-console.adoc[leveloffset=+1]
include::modules/eco-self-node-remediation-operator-installation-cli.adoc[leveloffset=+1]
@@ -32,4 +34,4 @@ To collect debugging information about the Self Node Remediation Operator, use t
== Additional resources
* The Self Node Remediation Operator is supported in a restricted network environment. For more information, see xref:../../operators/admin/olm-restricted-networks.adoc#olm-restricted-networks[Using Operator Lifecycle Manager on restricted networks].
* xref:../../operators/admin/olm-deleting-operators-from-cluster.adoc#olm-deleting-operators-from-a-cluster[Deleting Operators from a cluster]
* xref:../../operators/admin/olm-deleting-operators-from-cluster.adoc#olm-deleting-operators-from-a-cluster[Deleting Operators from a cluster]