From 6ab9dc4033c24f665aefecd020ec832372b76f52 Mon Sep 17 00:00:00 2001 From: Padraig O'Grady Date: Wed, 21 Dec 2022 11:39:11 +0000 Subject: [PATCH] TELCODOCS-843: Remediation, Fencing, and Maintentance concept details added [enterprise-4.12] TELCODOCS-843: Remediation, Fencing, and Maintentance concept details added --- _topic_maps/_topic_map.yml | 19 ++-- ...-self-node-remediation-operator-about.adoc | 90 ------------------ ...node-remediation-operator-configuring.adoc | 95 +++++++++++++++++++ ...iation-operator-control-plane-fencing.adoc | 3 +- ...ion-operator-installation-web-console.adoc | 7 +- modules/machine-health-checks-about.adoc | 2 - ...about-remediation-fencing-maintenance.adoc | 39 ++++++++ .../ecosystems/eco-machine-health-checks.adoc | 13 +++ .../eco-node-health-check-operator.adoc | 12 +-- .../eco-node-maintenance-operator.adoc | 12 +-- .../eco-self-node-remediation-operator.adoc | 16 ++-- nodes/nodes/ecosystems/images | 1 + release_notes/ocp-4-12-release-notes.adoc | 8 +- virt/install/preparing-cluster-for-virt.adoc | 2 +- .../virt-about-node-maintenance.adoc | 6 +- virt/upgrading-virt.adoc | 2 +- 16 files changed, 198 insertions(+), 129 deletions(-) create mode 100644 modules/eco-self-node-remediation-operator-configuring.adoc create mode 100644 nodes/nodes/ecosystems/eco-about-remediation-fencing-maintenance.adoc create mode 100644 nodes/nodes/ecosystems/eco-machine-health-checks.adoc rename nodes/nodes/{ => ecosystems}/eco-node-health-check-operator.adoc (51%) rename nodes/nodes/{ => ecosystems}/eco-node-maintenance-operator.adoc (78%) rename nodes/nodes/{ => ecosystems}/eco-self-node-remediation-operator.adoc (62%) create mode 120000 nodes/nodes/ecosystems/images diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml index 18426963f4..a4819e9964 100644 --- a/_topic_maps/_topic_map.yml +++ b/_topic_maps/_topic_map.yml @@ -2141,12 +2141,19 @@ Topics: File: nodes-nodes-managing-max-pods - Name: Using the Node Tuning Operator File: nodes-node-tuning-operator - - Name: Remediating nodes with the Self Node Remediation Operator - File: eco-self-node-remediation-operator - - Name: Deploying node health checks by using the Node Health Check Operator - File: eco-node-health-check-operator - - Name: Using the Node Maintenance Operator to place nodes in maintenance mode - File: eco-node-maintenance-operator + - Name: Remediation, fencing, and maintenance + Dir: ecosystems + Topics: + - Name: About node remediation, fencing, and maintentance + File: eco-about-remediation-fencing-maintenance + - Name: Using Self Node Remediation + File: eco-self-node-remediation-operator + - Name: Remediating nodes with Machine Health Checks + File: eco-machine-health-checks + - Name: Remediating nodes with Node Health Checks + File: eco-node-health-check-operator + - Name: Placing nodes in maintenance mode with Node Maintenance Operator + File: eco-node-maintenance-operator - Name: Understanding node rebooting File: nodes-nodes-rebooting - Name: Freeing node resources using garbage collection diff --git a/modules/eco-self-node-remediation-operator-about.adoc b/modules/eco-self-node-remediation-operator-about.adoc index a3f878de7d..cf4ffa94d2 100644 --- a/modules/eco-self-node-remediation-operator-about.adoc +++ b/modules/eco-self-node-remediation-operator-about.adoc @@ -25,93 +25,3 @@ status: <1> Displays the last error that occurred during remediation. When remediation succeeds or if no errors occur, the field is left empty. The Self Node Remediation Operator minimizes downtime for stateful applications and restores compute capacity if transient failures occur. You can use this Operator regardless of the management interface, such as IPMI or an API to provision a node, and regardless of the cluster installation type, such as installer-provisioned infrastructure or user-provisioned infrastructure. - -[id="understanding-self-node-remediation-operator-config_{context}"] -== Understanding the Self Node Remediation Operator configuration - -The Self Node Remediation Operator creates the `SelfNodeRemediationConfig` CR with the name `self-node-remediation-config`. The CR is created in the namespace of the Self Node Remediation Operator. - -A change in the `SelfNodeRemediationConfig` CR re-creates the Self Node Remediation daemon set. - -The `SelfNodeRemediationConfig` CR resembles the following YAML file: - -[source,yaml] ----- -apiVersion: self-node-remediation.medik8s.io/v1alpha1 -kind: SelfNodeRemediationConfig -metadata: - name: self-node-remediation-config - namespace: openshift-operators -spec: - safeTimeToAssumeNodeRebootedSeconds: 180 <1> - watchdogFilePath: /dev/watchdog <2> - isSoftwareRebootEnabled: true <3> - apiServerTimeout: 15s <4> - apiCheckInterval: 5s <5> - maxApiErrorThreshold: 3 <6> - peerApiServerTimeout: 5s <7> - peerDialTimeout: 5s <8> - peerRequestTimeout: 5s <9> - peerUpdateInterval: 15m <10> ----- - -<1> Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value. -<2> Specify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Self Node Remediation Operator automatically detects the softdog device path. -+ -If a watchdog device is unavailable, the `SelfNodeRemediationConfig` CR uses a software reboot. -<3> Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of `isSoftwareRebootEnabled` is set to `true`. To disable the software reboot, set the parameter value to `false`. -<4> Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be more than or equal to 10 milliseconds. -<5> Specify the frequency to check connectivity with each API server. The timeout duration must be more than or equal to 1 second. -<6> Specify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be more than or equal to 1 second. -<7> Specify the duration of the timeout for the peer to connect the API server. The timeout duration must be more than or equal to 10 milliseconds. -<8> Specify the duration of the timeout for establishing connection with the peer. The timeout duration must be more than or equal to 10 milliseconds. -<9> Specify the duration of the timeout to get a response from the peer. The timeout duration must be more than or equal to 10 milliseconds. -<10> Specify the frequency to update peer information, such as IP address. The timeout duration must be more than or equal to 10 seconds. - -[NOTE] -==== -You can edit the `self-node-remediation-config` CR that is created by the Self Node Remediation Operator. However, when you try to create a new CR for the Self Node Remediation Operator, the following message is displayed in the logs: - -[source,text] ----- -controllers.SelfNodeRemediationConfig -ignoring selfnoderemediationconfig CRs that are not named 'self-node-remediation-config' -or not in the namespace of the operator: -'openshift-operators' {"selfnoderemediationconfig": -"openshift-operators/selfnoderemediationconfig-copy"} ----- -==== - -[id="understanding-self-node-remediation-remediation-template-config_{context}"] -== Understanding the Self Node Remediation Template configuration - -The Self Node Remediation Operator also creates the `SelfNodeRemediationTemplate` Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes. The following remediation strategies are available: - -`ResourceDeletion`:: This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. `ResourceDeletion` is the default remediation strategy. - -`NodeDeletion`:: This remediation strategy is deprecated and will be removed in a future release. In the current release, the `ResourceDeletion` strategy is used even if the `NodeDeletion` strategy is selected. - - -The Self Node Remediation Operator creates the following `SelfNodeRemediationTemplate` CR for the strategy: - -* `self-node-remediation-resource-deletion-template`, which the `ResourceDeletion` remediation strategy uses -//* `self-node-remediation-node-deletion-template`, which the `NodeDeletion` remediation strategy uses - -The `SelfNodeRemediationTemplate` CR resembles the following YAML file: - -[source,yaml] ----- -apiVersion: self-node-remediation.medik8s.io/v1alpha1 -kind: SelfNodeRemediationTemplate -metadata: - creationTimestamp: "2022-03-02T08:02:40Z" - name: self-node-remediation--deletion-template <1> - namespace: openshift-operators -spec: - template: - spec: - remediationStrategy: <2> ----- -<1> Specifies the type of remediation template based on the remediation strategy. Replace `` with either `resource` or `node`; for example, `self-node-remediation-resource-deletion-template`. -//<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`. -<2> Specifies the remediation strategy. The remediation strategy is `ResourceDeletion`. diff --git a/modules/eco-self-node-remediation-operator-configuring.adoc b/modules/eco-self-node-remediation-operator-configuring.adoc new file mode 100644 index 0000000000..2345a3bdd5 --- /dev/null +++ b/modules/eco-self-node-remediation-operator-configuring.adoc @@ -0,0 +1,95 @@ +// Module included in the following assemblies: +// +// * nodes/nodes/eco-self-node-remediation-operator.adoc + +:_content-type: CONCEPT +[id="configuring-self-node-remediation-operator_{context}"] += Configuring the Self Node Remediation Operator + +The Self Node Remediation Operator creates the `SelfNodeRemediationConfig` CR and the `SelfNodeRemediationTemplate` Custom Resource Definition (CRD). + +[id="understanding-self-node-remediation-operator-config_{context}"] +== Understanding the Self Node Remediation Operator configuration + +The Self Node Remediation Operator creates the `SelfNodeRemediationConfig` CR with the name `self-node-remediation-config`. The CR is created in the namespace of the Self Node Remediation Operator. + +A change in the `SelfNodeRemediationConfig` CR re-creates the Self Node Remediation daemon set. + +The `SelfNodeRemediationConfig` CR resembles the following YAML file: + +[source,yaml] +---- +apiVersion: self-node-remediation.medik8s.io/v1alpha1 +kind: SelfNodeRemediationConfig +metadata: + name: self-node-remediation-config + namespace: openshift-operators +spec: + safeTimeToAssumeNodeRebootedSeconds: 180 <1> + watchdogFilePath: /dev/watchdog <2> + isSoftwareRebootEnabled: true <3> + apiServerTimeout: 15s <4> + apiCheckInterval: 5s <5> + maxApiErrorThreshold: 3 <6> + peerApiServerTimeout: 5s <7> + peerDialTimeout: 5s <8> + peerRequestTimeout: 5s <9> + peerUpdateInterval: 15m <10> +---- + +<1> Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value. +<2> Specify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Self Node Remediation Operator automatically detects the softdog device path. ++ +If a watchdog device is unavailable, the `SelfNodeRemediationConfig` CR uses a software reboot. +<3> Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of `isSoftwareRebootEnabled` is set to `true`. To disable the software reboot, set the parameter value to `false`. +<4> Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be greater than or equal to 10 milliseconds. +<5> Specify the frequency to check connectivity with each API server. The timeout duration must be greater than or equal to 1 second. +<6> Specify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be greater than or equal to 1 second. +<7> Specify the duration of the timeout for the peer to connect the API server. The timeout duration must be greater than or equal to 10 milliseconds. +<8> Specify the duration of the timeout for establishing connection with the peer. The timeout duration must be greater than or equal to 10 milliseconds. +<9> Specify the duration of the timeout to get a response from the peer. The timeout duration must be greater than or equal to 10 milliseconds. +<10> Specify the frequency to update peer information, such as IP address. The timeout duration must be greater than or equal to 10 seconds. + +[NOTE] +==== +You can edit the `self-node-remediation-config` CR that is created by the Self Node Remediation Operator. However, when you try to create a new CR for the Self Node Remediation Operator, the following message is displayed in the logs: + +[source,text] +---- +controllers.SelfNodeRemediationConfig +ignoring selfnoderemediationconfig CRs that are not named 'self-node-remediation-config' +or not in the namespace of the operator: +'openshift-operators' {"selfnoderemediationconfig": +"openshift-operators/selfnoderemediationconfig-copy"} +---- +==== + +[id="understanding-self-node-remediation-remediation-template-config_{context}"] +== Understanding the Self Node Remediation Template configuration + +The Self Node Remediation Operator also creates the `SelfNodeRemediationTemplate` Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes. The following remediation strategies are available: + +`ResourceDeletion`:: This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. `ResourceDeletion` is the default remediation strategy. + +`NodeDeletion`:: This remediation strategy is deprecated and will be removed in a future release. In the current release, the `ResourceDeletion` strategy is used even if the `NodeDeletion` strategy is selected. + +The Self Node Remediation Operator creates the `SelfNodeRemediationTemplate` CR for the strategy `self-node-remediation-resource-deletion-template`, which the `ResourceDeletion` remediation strategy uses. + +The `SelfNodeRemediationTemplate` CR resembles the following YAML file: + +[source,yaml] +---- +apiVersion: self-node-remediation.medik8s.io/v1alpha1 +kind: SelfNodeRemediationTemplate +metadata: + creationTimestamp: "2022-03-02T08:02:40Z" + name: self-node-remediation--deletion-template <1> + namespace: openshift-operators +spec: + template: + spec: + remediationStrategy: <2> +---- +<1> Specifies the type of remediation template based on the remediation strategy. Replace `` with either `resource` or `node`; for example, `self-node-remediation-resource-deletion-template`. +//<2> Specifies the remediation strategy. The remediation strategy can either be `ResourceDeletion` or `NodeDeletion`. +<2> Specifies the remediation strategy. The remediation strategy is `ResourceDeletion`. diff --git a/modules/eco-self-node-remediation-operator-control-plane-fencing.adoc b/modules/eco-self-node-remediation-operator-control-plane-fencing.adoc index 6595399d3a..d34ae660e3 100644 --- a/modules/eco-self-node-remediation-operator-control-plane-fencing.adoc +++ b/modules/eco-self-node-remediation-operator-control-plane-fencing.adoc @@ -19,8 +19,9 @@ Self Node Remediation occurs in two primary scenarios. ** When there is no API Server Connectivity, the control plane node will be remediated as outlined with these steps: -*** Check the status of the control plane node with the majority of the peer worker nodes. If its status is unhealthy or unknown, even if the control plane node can communicate with the peer worker nodes, the node will be analyzed further. +*** Check the status of the control plane node with the majority of the peer worker nodes. If the majority of the peer worker nodes cannot be reached, the node will be analyzed further. **** Self-diagnose the status of the control plane node ***** If self diagnostics passed, no action will be taken. ***** If self diagnostics failed, the node will be fenced and remediated. +***** The self diagnostics currently supported are checking the `kubelet` service status, and checking endpoint availability using `opt in` configuration. *** If the node did not manage to communicate to most of its worker peers, check the connectivity of the control plane node with other control plane nodes. If the node can communicate with any other control plane peer, no action will be taken. Otherwise, the node will be fenced and remediated. diff --git a/modules/eco-self-node-remediation-operator-installation-web-console.adoc b/modules/eco-self-node-remediation-operator-installation-web-console.adoc index 1c9655e4c4..1f44cdb638 100644 --- a/modules/eco-self-node-remediation-operator-installation-web-console.adoc +++ b/modules/eco-self-node-remediation-operator-installation-web-console.adoc @@ -8,6 +8,11 @@ You can use the {product-title} web console to install the Self Node Remediation Operator. +[NOTE] +==== +The Node Health Check Operator also installs the Self Node Remediation Operator as a default remediation provider. +==== + .Prerequisites * Log in as a user with `cluster-admin` privileges. @@ -29,4 +34,4 @@ To confirm that the installation is successful: If the Operator is not installed successfully: . Navigate to the *Operators* -> *Installed Operators* page and inspect the `Status` column for any errors or failures. -. Navigate to the *Workloads* -> *Pods* page and check the logs in any pods in the `self-node-remediation-controller-manager` project that are reporting issues. \ No newline at end of file +. Navigate to the *Workloads* -> *Pods* page and check the logs in any pods in the `self-node-remediation-controller-manager` project that are reporting issues. diff --git a/modules/machine-health-checks-about.adoc b/modules/machine-health-checks-about.adoc index aefa60ac86..d669e0ce68 100644 --- a/modules/machine-health-checks-about.adoc +++ b/modules/machine-health-checks-about.adoc @@ -7,8 +7,6 @@ [id="machine-health-checks-about_{context}"] = About machine health checks -Machine health checks automatically repair unhealthy machines in a particular machine pool. - [NOTE] ==== You can only apply a machine health check to control plane machines on clusters that use control plane machine sets. diff --git a/nodes/nodes/ecosystems/eco-about-remediation-fencing-maintenance.adoc b/nodes/nodes/ecosystems/eco-about-remediation-fencing-maintenance.adoc new file mode 100644 index 0000000000..962a25cacd --- /dev/null +++ b/nodes/nodes/ecosystems/eco-about-remediation-fencing-maintenance.adoc @@ -0,0 +1,39 @@ +:_content-type: ASSEMBLY +[id="about-remediation-fencing-maintenance"] += About node remediation, fencing, and maintenance +include::_attributes/common-attributes.adoc[] +:context: about-node-remediation-fencing-maintenance + +toc::[] + +Hardware is imperfect and software contains bugs. When node-level failures, such as the kernel hangs or network interface controllers (NICs) fail, the work required from the cluster does not decrease, and workloads from affected nodes need to be restarted somewhere. However, some workloads, such as ReadWriteOnce (RWO) volumes and StatefulSets, might require at-most-one semantics. + +Failures affecting these workloads risk data loss, corruption, or both. It is important to ensure that the node reaches a safe state, known as `fencing` before initiating recovery of the workload, known as `remediation` and ideally, recovery of the node also. + +It is not always practical to depend on administrator intervention to confirm the true status of the nodes and workloads. To facilitate such intervention, {product-title} provides multiple components for the automation of failure detection, fencing and remediation. + +[id="about-remediation-fencing-maintenance-snr"] +== Self Node Remediation + +The Self Node Remediation Operator is an {product-title} add-on operator which implements an external system of fencing and remediation that reboots unhealthy nodes and deletes resources, such as, Pods and VolumeAttachments. The reboot ensures that the workloads are fenced, and the resource deletion accelerates the rescheduling of affected workloads. Unlike other external systems, Self Node Remediation does not require any management interface, like, for example, Intelligent Platform Management Interface (IPMI) or an API for node provisioning. + +Self Node Remediation can be used by failure detection systems, like Machine Health Check or Node Health Check. + +[id="about-remediation-fencing-maintenance-mhc"] +== Machine Health Check + +Machine Health Check utilizes an {product-title} built-in failure detection, fencing and remediation system, which monitors the status of machines and the conditions of nodes. Machine Health Checks can be configured to trigger external fencing and remediation systems, like Self Node Remediation. + +[id="about-remediation-fencing-maintenance-nhc"] +== Node Health Check + +The Node Health Check Operator is an {product-title} add-on operator which implements a failure detection system that monitors node conditions. It does not have a built-in fencing or remediation system and so must be configured with an external system that provides such features. By default, it is configured to utilize the Self Node Remediation system. + +[id="about-remediation-fencing-maintenance-node"] +== Node Maintenance + +Administrators face situations where they need to interrupt the cluster, for example, replace a drive, RAM, or a NIC. + +In advance of this maintenance, affected nodes should be cordoned and drained. When a node is cordoned, new workloads cannot be scheduled on that node. When a node is drained, to avoid or minimize downtime, workloads on the affected node are transferred to other nodes. + +While this maintenance can be achieved using command line tools, the Node Maintenance Operator offers a declarative approach to achieve this by using a custom resource. When such a resource exists for a node, the operator cordons and drains the node until the resource is deleted. diff --git a/nodes/nodes/ecosystems/eco-machine-health-checks.adoc b/nodes/nodes/ecosystems/eco-machine-health-checks.adoc new file mode 100644 index 0000000000..8e0034cd27 --- /dev/null +++ b/nodes/nodes/ecosystems/eco-machine-health-checks.adoc @@ -0,0 +1,13 @@ +:_content-type: ASSEMBLY +[id="machine-health-checks"] += Remediating nodes with Machine Health Checks +include::_attributes/common-attributes.adoc[] +:context: machine-health-checks + +toc::[] + +Machine health checks automatically repair unhealthy machines in a particular machine pool. + +include::modules/machine-health-checks-about.adoc[leveloffset=+1] + +include::modules/eco-configuring-machine-health-check-with-self-node-remediation.adoc[leveloffset=+1] diff --git a/nodes/nodes/eco-node-health-check-operator.adoc b/nodes/nodes/ecosystems/eco-node-health-check-operator.adoc similarity index 51% rename from nodes/nodes/eco-node-health-check-operator.adoc rename to nodes/nodes/ecosystems/eco-node-health-check-operator.adoc index d0975c5cd7..fef364decc 100644 --- a/nodes/nodes/eco-node-health-check-operator.adoc +++ b/nodes/nodes/ecosystems/eco-node-health-check-operator.adoc @@ -1,17 +1,17 @@ :_content-type: ASSEMBLY [id="node-health-check-operator"] -= Deploying node health checks by using the Node Health Check Operator += Remediating nodes with Node Health Checks include::_attributes/common-attributes.adoc[] :context: node-health-check-operator toc::[] -Use the Node Health Check Operator to identify unhealthy nodes. The Operator uses the Self Node Remediation Operator to remediate the unhealthy nodes. +You can use the Node Health Check Operator to identify unhealthy nodes. The Operator uses the Self Node Remediation Operator to remediate the unhealthy nodes. [role="_additional-resources"] .Additional resources -xref:../../nodes/nodes/eco-self-node-remediation-operator.adoc#self-node-remediation-operator-remediate-nodes[Remediating nodes with the Self Node Remediation Operator] +xref:../../../nodes/nodes/ecosystems/eco-self-node-remediation-operator.adoc#self-node-remediation-operator-remediate-nodes[Remediating nodes with the Self Node Remediation Operator] include::modules/eco-node-health-check-operator-about.adoc[leveloffset=+1] @@ -25,9 +25,9 @@ include::modules/eco-node-health-check-operator-creating-node-health-check.adoc[ [id="gather-data-nhc"] == Gathering data about the Node Health Check Operator -To collect debugging information about the Node Health Check Operator, use the `must-gather` tool. For information about the `must-gather` image for the Node Health Check Operator, see xref:../../support/gathering-cluster-data.adoc#gathering-data-specific-features_gathering-cluster-data[Gathering data about specific features]. +To collect debugging information about the Node Health Check Operator, use the `must-gather` tool. For information about the `must-gather` image for the Node Health Check Operator, see xref:../../../support/gathering-cluster-data.adoc#gathering-data-specific-features_gathering-cluster-data[Gathering data about specific features]. [id="additional-resources-nhc-operator-installation"] == Additional resources -* xref:../../operators/admin/olm-upgrading-operators.adoc#olm-changing-update-channel_olm-upgrading-operators[Changing the update channel for an Operator] -* The Node Health Check Operator is supported in a restricted network environment. For more information, see xref:../../operators/admin/olm-restricted-networks.adoc#olm-restricted-networks[Using Operator Lifecycle Manager on restricted networks]. +* xref:../../../operators/admin/olm-upgrading-operators.adoc#olm-changing-update-channel_olm-upgrading-operators[Changing the update channel for an Operator] +* xref:../../../operators/admin/olm-restricted-networks.adoc#olm-restricted-networks[Using Operator Lifecycle Manager on restricted networks]. diff --git a/nodes/nodes/eco-node-maintenance-operator.adoc b/nodes/nodes/ecosystems/eco-node-maintenance-operator.adoc similarity index 78% rename from nodes/nodes/eco-node-maintenance-operator.adoc rename to nodes/nodes/ecosystems/eco-node-maintenance-operator.adoc index 39393e4cd2..f02c043bc0 100644 --- a/nodes/nodes/eco-node-maintenance-operator.adoc +++ b/nodes/nodes/ecosystems/eco-node-maintenance-operator.adoc @@ -1,6 +1,6 @@ :_content-type: ASSEMBLY [id="node-maintenance-operator"] -= Using the Node Maintenance Operator to place nodes in maintenance mode += Placing nodes in maintenance mode with Node Maintenance Operator include::_attributes/common-attributes.adoc[] :context: node-maintenance-operator @@ -23,7 +23,7 @@ include::modules/eco-node-maintenance-operator-installation-web-console.adoc[lev include::modules/eco-node-maintenance-operator-installation-cli.adoc[leveloffset=+2] -The Node Maintenance Operator is supported in a restricted network environment. For more information, see xref:../../operators/admin/olm-restricted-networks.adoc#olm-restricted-networks[Using Operator Lifecycle Manager on restricted networks]. +The Node Maintenance Operator is supported in a restricted network environment. For more information, see xref:../../../operators/admin/olm-restricted-networks.adoc#olm-restricted-networks[Using Operator Lifecycle Manager on restricted networks]. [id="setting-node-in-maintenance-mode"] == Setting a node to maintenance mode @@ -60,11 +60,11 @@ include::modules/eco-resuming-node-maintenance-actions-web-console.adoc[leveloff [id="gather-data-nmo"] == Gathering data about the Node Maintenance Operator -To collect debugging information about the Node Maintenance Operator, use the `must-gather` tool. For information about the `must-gather` image for the Node Maintenance Operator, see xref:../../support/gathering-cluster-data.adoc#gathering-data-specific-features_gathering-cluster-data[Gathering data about specific features]. +To collect debugging information about the Node Maintenance Operator, use the `must-gather` tool. For information about the `must-gather` image for the Node Maintenance Operator, see xref:../../../support/gathering-cluster-data.adoc#gathering-data-specific-features_gathering-cluster-data[Gathering data about specific features]. [role="_additional-resources"] [id="additional-resources-node-maintenance-operator-installation"] == Additional resources -* xref:../../support/gathering-cluster-data.adoc#gathering-cluster-data[Gathering data about your cluster] -* xref:../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-evacuating_nodes-nodes-working[Understanding how to evacuate pods on nodes] -* xref:../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-marking_nodes-nodes-working[Understanding how to mark nodes as unschedulable or schedulable] +* xref:../../../support/gathering-cluster-data.adoc#gathering-cluster-data[Gathering data about your cluster] +* xref:../../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-evacuating_nodes-nodes-working[Understanding how to evacuate pods on nodes] +* xref:../../../nodes/nodes/nodes-nodes-working.adoc#nodes-nodes-working-marking_nodes-nodes-working[Understanding how to mark nodes as unschedulable or schedulable] diff --git a/nodes/nodes/eco-self-node-remediation-operator.adoc b/nodes/nodes/ecosystems/eco-self-node-remediation-operator.adoc similarity index 62% rename from nodes/nodes/eco-self-node-remediation-operator.adoc rename to nodes/nodes/ecosystems/eco-self-node-remediation-operator.adoc index 171704b09b..2e21ddc6d4 100644 --- a/nodes/nodes/eco-self-node-remediation-operator.adoc +++ b/nodes/nodes/ecosystems/eco-self-node-remediation-operator.adoc @@ -1,6 +1,6 @@ :_content-type: ASSEMBLY [id="self-node-remediation-operator-remediate-nodes"] -= Remediating nodes with the Self Node Remediation Operator += Using Self Node Remediation include::_attributes/common-attributes.adoc[] :context: self-node-remediation-operator-remediate-nodes @@ -12,8 +12,10 @@ include::modules/eco-self-node-remediation-operator-about.adoc[leveloffset=+1] include::modules/eco-self-node-remediation-about-watchdog.adoc[leveloffset=+2] +[role="_additional-resources"] .Additional resources -xref:../../virt/virtual_machines/advanced_vm_management/virt-configuring-a-watchdog.adoc#virt-configuring-a-watchdog[Configuring a watchdog] + +xref:../../../virt/virtual_machines/advanced_vm_management/virt-configuring-a-watchdog.adoc#virt-configuring-a-watchdog[Configuring a watchdog] include::modules/eco-self-node-remediation-operator-control-plane-fencing.adoc[leveloffset=+1] @@ -21,17 +23,15 @@ include::modules/eco-self-node-remediation-operator-installation-web-console.ado include::modules/eco-self-node-remediation-operator-installation-cli.adoc[leveloffset=+1] -include::modules/eco-configuring-machine-health-check-with-self-node-remediation.adoc[leveloffset=+1] +include::modules/eco-self-node-remediation-operator-configuring.adoc[leveloffset=+1] include::modules/eco-self-node-remediation-operator-troubleshooting.adoc[leveloffset=+1] [id="gather-data-self-node-remediation"] == Gathering data about the Self Node Remediation Operator -To collect debugging information about the Self Node Remediation Operator, use the `must-gather` tool. For information about the `must-gather` image for the Self Node Remediation Operator, see xref:../../support/gathering-cluster-data.adoc#gathering-data-specific-features_gathering-cluster-data[Gathering data about specific features]. +To collect debugging information about the Self Node Remediation Operator, use the `must-gather` tool. For information about the `must-gather` image for the Self Node Remediation Operator, see xref:../../../support/gathering-cluster-data.adoc#gathering-data-specific-features_gathering-cluster-data[Gathering data about specific features]. -[role="_additional-resources"] [id="additional-resources-self-node-remediation-operator-installation"] == Additional resources - -* The Self Node Remediation Operator is supported in a restricted network environment. For more information, see xref:../../operators/admin/olm-restricted-networks.adoc#olm-restricted-networks[Using Operator Lifecycle Manager on restricted networks]. -* xref:../../operators/admin/olm-deleting-operators-from-cluster.adoc#olm-deleting-operators-from-a-cluster[Deleting Operators from a cluster] +* xref:../../../operators/admin/olm-restricted-networks.adoc#olm-restricted-networks[Using Operator Lifecycle Manager on restricted networks]. +* xref:../../../operators/admin/olm-deleting-operators-from-cluster.adoc#olm-deleting-operators-from-a-cluster[Deleting Operators from a cluster] diff --git a/nodes/nodes/ecosystems/images b/nodes/nodes/ecosystems/images new file mode 120000 index 0000000000..5fa6987088 --- /dev/null +++ b/nodes/nodes/ecosystems/images @@ -0,0 +1 @@ +../../images \ No newline at end of file diff --git a/release_notes/ocp-4-12-release-notes.adoc b/release_notes/ocp-4-12-release-notes.adoc index 979d24a382..7634e7b7ed 100644 --- a/release_notes/ocp-4-12-release-notes.adoc +++ b/release_notes/ocp-4-12-release-notes.adoc @@ -1026,16 +1026,16 @@ For more information, see xref:../nodes/jobs/nodes-nodes-jobs.adoc#nodes-nodes-j [id="ocp-4-12-self-node-remediation-operator"] ==== Self Node Remediation Operator enhancements -{product-title} now supports control plane fencing by the Self Node Remediation Operator. In the event of node failure, you can follow remediation strategies on both worker nodes and control plane nodes. For more information, see xref:../nodes/nodes/eco-self-node-remediation-operator.adoc#control-plane-fencing-self-node-remediation-operator_self-node-remediation-operator-remediate-nodes[Control Plane Fencing]. +{product-title} now supports control plane fencing by the Self Node Remediation Operator. In the event of node failure, you can follow remediation strategies on both worker nodes and control plane nodes. For more information, see xref:../nodes/nodes/ecosystems/eco-self-node-remediation-operator.adoc#control-plane-fencing-self-node-remediation-operator_self-node-remediation-operator-remediate-nodes[Control Plane Fencing]. [id="ocp-4-12-node-health-check-operator"] ==== Node Health Check Operator enhancements -{product-title} now supports control plane fencing on the Node Health Check Operator. In the event of node failure, you can follow remediation strategies on both worker nodes and control plane nodes. For more information, see xref:../nodes/nodes/eco-node-health-check-operator.adoc#control-plane-fencing-node-health-check-operator_node-health-check-operator[Control Plane Fencing]. +{product-title} now supports control plane fencing on the Node Health Check Operator. In the event of node failure, you can follow remediation strategies on both worker nodes and control plane nodes. For more information, see xref:../nodes/nodes/ecosystems/eco-node-health-check-operator.adoc#control-plane-fencing-node-health-check-operator_node-health-check-operator[Control Plane Fencing]. -The Node Health Check Operator now also includes a web console plug-in for managing Node Health Checks. For more information, see xref:../nodes/nodes/eco-node-health-check-operator.adoc#eco-node-health-check-operator-creating-node-health-check_node-health-check-operator[Creating a node health check]. +The Node Health Check Operator now also includes a web console plug-in for managing Node Health Checks. For more information, see xref:../nodes/nodes/ecosystems/eco-node-health-check-operator.adoc#eco-node-health-check-operator-creating-node-health-check_node-health-check-operator[Creating a node health check]. -For installing or updating to the latest version of the Node Health Check Operator, use the `stable` subscription channel. For more information, see xref:../nodes/nodes/eco-node-health-check-operator.adoc#installing-node-health-check-operator-using-cli_node-health-check-operator[Installing the Node Health Check Operator by using the CLI]. +For installing or updating to the latest version of the Node Health Check Operator, use the `stable` subscription channel. For more information, see xref:../nodes/nodes/ecosystems/eco-node-health-check-operator.adoc#installing-node-health-check-operator-using-cli_node-health-check-operator[Installing the Node Health Check Operator by using the CLI]. [id="ocp-4-11-logging"] === Logging diff --git a/virt/install/preparing-cluster-for-virt.adoc b/virt/install/preparing-cluster-for-virt.adoc index 65213f36f2..d781440173 100644 --- a/virt/install/preparing-cluster-for-virt.adoc +++ b/virt/install/preparing-cluster-for-virt.adoc @@ -72,7 +72,7 @@ You can configure one of the following high-availability (HA) options for your c In {product-title} clusters installed using installer-provisioned infrastructure and with MachineHealthCheck properly configured, if a node fails the MachineHealthCheck and becomes unavailable to the cluster, it is recycled. What happens next with VMs that ran on the failed node depends on a series of conditions. See xref:../../virt/virtual_machines/virt-create-vms.adoc#virt-about-runstrategies-vms_virt-create-vms[About RunStrategies for virtual machines] for more detailed information about the potential outcomes and how RunStrategies affect those outcomes. ==== -* Automatic high availability for both IPI and non-IPI is available by using the xref:../../nodes/nodes/eco-node-health-check-operator.adoc#node-health-check-operator[Node Health Check Operator] on the {product-title} cluster to deploy the `NodeHealthCheck` controller. The controller identifies unhealthy nodes and uses the Self Node Remediation Operator to remediate the unhealthy nodes. +* Automatic high availability for both IPI and non-IPI is available by using the xref:../../nodes/nodes/ecosystems/eco-node-health-check-operator.adoc#node-health-check-operator[Node Health Check Operator] on the {product-title} cluster to deploy the `NodeHealthCheck` controller. The controller identifies unhealthy nodes and uses the Self Node Remediation Operator to remediate the unhealthy nodes. + -- ifdef::openshift-enterprise[] diff --git a/virt/node_maintenance/virt-about-node-maintenance.adoc b/virt/node_maintenance/virt-about-node-maintenance.adoc index 2ffdcdc347..a9826495b4 100644 --- a/virt/node_maintenance/virt-about-node-maintenance.adoc +++ b/virt/node_maintenance/virt-about-node-maintenance.adoc @@ -13,9 +13,9 @@ include::modules/virt-maintaining-bare-metal-nodes.adoc[leveloffset=+1] [role="_additional-resources"] [id="additional-resources_virt-about-node-maintenance"] == Additional resources -* xref:../../nodes/nodes/eco-node-maintenance-operator.adoc#installing-maintenance-operator-using-cli_node-maintenance-operator[Installing the Node Maintenance Operator by using the CLI] -* xref:../../nodes/nodes/eco-node-maintenance-operator.adoc#setting-node-in-maintenance-mode[Setting a node to maintenance mode] -* xref:../../nodes/nodes/eco-node-maintenance-operator.adoc#resuming-node-from-maintenance-mode[Resuming a node from maintenance mode] +* xref:../../nodes/nodes/ecosystems/eco-node-maintenance-operator.adoc#installing-maintenance-operator-using-cli_node-maintenance-operator[Installing the Node Maintenance Operator by using the CLI] +* xref:../../nodes/nodes/ecosystems/eco-node-maintenance-operator.adoc#setting-node-in-maintenance-mode[Setting a node to maintenance mode] +* xref:../../nodes/nodes/ecosystems/eco-node-maintenance-operator.adoc#resuming-node-from-maintenance-mode[Resuming a node from maintenance mode] * xref:../../virt/virtual_machines/virt-create-vms.adoc#virt-about-runstrategies-vms_virt-create-vms[About RunStrategies for virtual machines] * xref:../../virt/live_migration/virt-live-migration.adoc#virt-live-migration[Virtual machine live migration] * xref:../../virt/live_migration/virt-configuring-vmi-eviction-strategy.adoc#virt-configuring-vmi-eviction-strategy[Configuring virtual machine eviction strategy] diff --git a/virt/upgrading-virt.adoc b/virt/upgrading-virt.adoc index 66fad63a22..64d286be55 100644 --- a/virt/upgrading-virt.adoc +++ b/virt/upgrading-virt.adoc @@ -10,7 +10,7 @@ Learn how Operator Lifecycle Manager (OLM) delivers z-stream and minor version u [NOTE] ==== -* The Node Maintenance Operator (NMO) is no longer shipped with {VirtProductName}. You can xref:../nodes/nodes/eco-node-maintenance-operator.adoc#node-maintenance-operator[install the NMO] from the *OperatorHub* in the {product-title} web console, or by using the OpenShift CLI (`oc`). +* The Node Maintenance Operator (NMO) is no longer shipped with {VirtProductName}. You can xref:../nodes/nodes/ecosystems/eco-node-maintenance-operator.adoc#node-maintenance-operator[install the NMO] from the *OperatorHub* in the {product-title} web console, or by using the OpenShift CLI (`oc`). + You must perform one of the following tasks before updating to {VirtProductName} 4.11 from {VirtProductName} 4.10.2 and later releases: