1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00

Machine deletion hooks

This commit is contained in:
Jeana Routh
2023-01-20 16:41:05 -05:00
committed by openshift-cherrypick-robot
parent 7eb09f2175
commit ca1984d81f
15 changed files with 302 additions and 26 deletions

View File

@@ -39,6 +39,9 @@ Depending on the state of your unhealthy etcd member, use one of the following p
// Replacing an unhealthy etcd member whose machine is not running or whose node is not ready
include::modules/restore-replace-stopped-etcd-member.adoc[leveloffset=+2]
[role="_additional-resources"]
.Additional resources
* xref:../../machine_management/control_plane_machine_management/cpmso-troubleshooting.adoc#cpmso-ts-etcd-degraded_cpmso-troubleshooting[Recovering a degraded etcd Operator]
// Replacing an unhealthy etcd member whose etcd pod is crashlooping
include::modules/restore-replace-crashlooping-etcd-member.adoc[leveloffset=+2]
@@ -46,4 +49,7 @@ include::modules/restore-replace-crashlooping-etcd-member.adoc[leveloffset=+2]
// Replacing an unhealthy baremetal stopped etcd member
include::modules/restore-replace-stopped-baremetal-etcd-member.adoc[leveloffset=+2]
[role="_additional-resources"]
[id="additional-resources_replacing-unhealthy-etcd-member"]
== Additional resources
* xref:../../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-etcd_deleting-machine[Quorum protection with machine lifecycle hooks]

View File

@@ -29,6 +29,10 @@ You might run into several situations where {product-title} does not work as ex
You can always recover from a disaster situation by xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restoring your cluster to its previous state] using the saved etcd snapshots.
[role="_additional-resources"]
.Additional resources
* xref:../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-etcd_deleting-machine[Quorum protection with machine lifecycle hooks]
[id="application-backup-restore-operations-overview"]
== Application backup and restore operations

Binary file not shown.

After

Width:  |  Height:  |  Size: 216 KiB

View File

@@ -15,10 +15,8 @@ When possible, the control plane machine set spreads the control plane machines
//Failure domain platform support and configuration
include::modules/cpmso-failure-domains-provider.adoc[leveloffset=+2]
[role="_additional-resources"]
.Additional resources
* xref:../../machine_management/control_plane_machine_management/cpmso-configuration.adoc#cpmso-yaml-failure-domain-aws_cpmso-configuration[Sample Amazon Web Services failure domain configuration]
* xref:../../machine_management/control_plane_machine_management/cpmso-configuration.adoc#cpmso-yaml-failure-domain-gcp_cpmso-configuration[Sample Google Cloud Platform failure domain configuration]
@@ -30,8 +28,12 @@ include::modules/cpmso-failure-domains-balancing.adoc[leveloffset=+2]
//Recovery of the failed control plane machines
include::modules/cpmso-control-plane-recovery.adoc[leveloffset=+1]
[role="_additional-resources"]
.Additional resources
* xref:../../machine_management/deploying-machine-health-checks.adoc#deploying-machine-health-checks[Deploying machine health checks]
* xref:../../machine_management/deploying-machine-health-checks.adoc#deploying-machine-health-checks[Deploying machine health checks]
//Quorum protection with machine lifecycle hooks
include::modules/machine-lifecycle-hook-deletion-etcd.adoc[leveloffset=+1]
[role="_additional-resources"]
.Additional resources
* xref:../../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion_deleting-machine[Lifecycle hooks for the machine deletion phase]

View File

@@ -8,8 +8,22 @@ toc::[]
You can delete a specific machine.
//Deleting a specific machine
include::modules/machine-delete.adoc[leveloffset=+1]
//Lifecycle hooks for the machine deletion phase
include::modules/machine-lifecycle-hook-deletion.adoc[leveloffset=+1]
//Deletion lifecycle hook configuration
include::modules/machine-lifecycle-hook-deletion-format.adoc[leveloffset=+2]
//Machine deletion lifecycle hook examples for Operator developers
include::modules/machine-lifecycle-hook-deletion-uses.adoc[leveloffset=+2]
//Quorum protection with machine lifecycle hooks
include::modules/machine-lifecycle-hook-deletion-etcd.adoc[leveloffset=+2]
[role="_additional-resources"]
[id="additional-resources_unhealthy-etcd-member"]
== Additional resources

View File

@@ -22,3 +22,8 @@ include::modules/machine-user-provisioned-limitations.adoc[leveloffset=+1]
include::modules/machineset-manually-scaling.adoc[leveloffset=+1]
include::modules/machineset-delete-policy.adoc[leveloffset=+1]
[role="_additional-resources"]
[id="additional-resources_manually-scaling-machineset"]
== Additional resources
* xref:../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion_deleting-machine[Lifecycle hooks for the machine deletion phase]

View File

@@ -17,6 +17,10 @@ If you need to scale a compute machine set without making other changes, see xre
include::modules/machineset-modifying.adoc[leveloffset=+1]
[role="_additional-resources"]
.Additional resources
* xref:../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion_deleting-machine[Lifecycle hooks for the machine deletion phase]
[id="migrating-nodes-to-a-different-storage-domain-rhv_{context}"]
== Migrating nodes to a different storage domain on {rh-virtualization}

View File

@@ -7,12 +7,12 @@
[id="machine-delete_{context}"]
= Deleting a specific machine
You can delete a specific machine.
You can delete a specific machine.
[IMPORTANT]
====
Do not delete a control plane machine unless your cluster uses a control plane machine set.
====
====
.Prerequisites
@@ -22,24 +22,29 @@ Do not delete a control plane machine unless your cluster uses a control plane m
.Procedure
. View the machines that are in the cluster and identify the one to delete:
. View the machines that are in the cluster by running the following command:
+
[source,terminal]
----
$ oc get machine -n openshift-machine-api
----
+
The command output contains a list of machines in the `<clusterid>-worker-<cloud_region>` format.
The command output contains a list of machines in the `<clusterid>-<role>-<cloud_region>` format.
. Delete the machine:
. Identify the machine that you want to delete.
. Delete the machine by running the following command:
+
[source,terminal]
----
$ oc delete machine <machine> -n openshift-machine-api
----
+
[IMPORTANT]
====
By default, the machine controller tries to drain the node that is backed by the machine until it succeeds. In some situations, such as with a misconfigured pod disruption budget, the drain operation might not be able to succeed in preventing the machine from being deleted. You can skip draining the node by annotating "machine.openshift.io/exclude-node-draining" in a specific machine. If the machine being deleted belongs to a compute machine set, a new machine is immediately created to satisfy the specified number of replicas.
By default, the machine controller tries to drain the node that is backed by the machine until it succeeds. In some situations, such as with a misconfigured pod disruption budget, the drain operation might not be able to succeed. If the drain operation fails, the machine controller cannot proceed removing the machine.
You can skip draining the node by annotating `machine.openshift.io/exclude-node-draining` in a specific machine.
====
+
If the machine that you delete belongs to a machine set, a new machine is immediately created to satisfy the specified number of replicas.

View File

@@ -0,0 +1,55 @@
// Module included in the following assemblies:
//
// * machine_management/deleting-machine.adoc
:_content-type: CONCEPT
[id="machine-lifecycle-hook-deletion-etcd_{context}"]
= Quorum protection with machine lifecycle hooks
For {product-title} clusters that use the Machine API Operator, the etcd Operator uses lifecycle hooks for the machine deletion phase to implement a quorum protection mechanism.
By using a `preDrain` lifecycle hook, the etcd Operator can control when the pods on a control plane machine are drained and removed. To protect etcd quorum, the etcd Operator prevents the removal of an etcd member until it migrates that member onto a new node within the cluster.
This mechanism allows the etcd Operator precise control over the members of the etcd quorum and allows the Machine API Operator to safely create and remove control plane machines without specific operational knowledge of the etcd cluster.
[id="machine-lifecycle-hook-deletion-etcd-order_{context}"]
== Control plane deletion with quorum protection processing order
When a control plane machine is replaced on a cluster that uses a control plane machine set, the cluster temporarily has four control plane machines. When the fourth control plane node joins the cluster, the etcd Operator starts a new etcd member on the replacement node. When the etcd Operator observes that the old control plane machine is marked for deletion, it stops the etcd member on the old node and promotes the replacement etcd member to join the quorum of the cluster.
The control plane machine `Deleting` phase proceeds in the following order:
. A control plane machine is slated for deletion.
. The control plane machine enters the `Deleting` phase.
. To satisfy the `preDrain` lifecycle hook, the etcd Operator takes the following actions:
+
--
.. The etcd Operator waits until a fourth control plane machine is added to the cluster as an etcd member. This new etcd member has a state of `Running` but not `ready` until it receives the full database update from the etcd leader.
.. When the new etcd member receives the full database update, the etcd Operator promotes the new etcd member to a voting member and removes the old etcd member from the cluster.
--
After this transition is complete, it is safe for the old etcd pod and its data to be removed, so the `preDrain` lifecycle hook is removed.
. The control plane machine status condition `Drainable` is set to `True`.
. The machine controller attempts to drain the node that is backed by the control plane machine.
** If draining fails, `Drained` is set to `False` and the machine controller attempts to drain the node again.
** If draining succeeds, `Drained` is set to `True`.
. The control plane machine status condition `Drained` is set to `True`.
. If no other Operators have added a `preTerminate` lifecycle hook, the control plane machine status condition `Terminable` is set to `True`.
. The machine controller removes the instance from the infrastructure provider.
. The machine controller deletes the `Node` object.
.YAML snippet demonstrating the etcd quorum protection `preDrain` lifecycle hook
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: ControlPlaneMachineSet
metadata:
...
spec:
lifecycleHooks:
preDrain:
- name: EtcdQuorumOperator <1>
owner: clusteroperator/etcd <2>
...
----
<1> The name of the `preDrain` lifecycle hook.
<2> The hook-implementing controller that manages the `preDrain` lifecycle hook.

View File

@@ -0,0 +1,74 @@
// Module included in the following assemblies:
//
// * machine_management/deleting-machine.adoc
:_content-type: REFERENCE
[id="machine-lifecycle-hook-deletion-format_{context}"]
= Deletion lifecycle hook configuration
The following YAML snippets demonstrate the format and placement of deletion lifecycle hook configurations within a machine set:
.YAML snippet demonstrating a `preDrain` lifecycle hook
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
...
spec:
lifecycleHooks:
preDrain:
- name: <hook-name> <1>
owner: <hook-owner> <2>
...
----
<1> The name of the `preDrain` lifecycle hook.
<2> The hook-implementing controller that manages the `preDrain` lifecycle hook.
.YAML snippet demonstrating a `preTerminate` lifecycle hook
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
...
spec:
lifecycleHooks:
preTerminate:
- name: <hook-name> <1>
owner: <hook-owner> <2>
...
----
<1> The name of the `preTerminate` lifecycle hook.
<2> The hook-implementing controller that that manages the `preTerminate` lifecycle hook.
[discrete]
[id="machine-lifecycle-hook-deletion-example_{context}"]
== Example lifecycle hook configuration
The following example demonstrates the implementation of multiple fictional lifecycle hooks that interrupt the machine deletion process:
.Example configuration for lifecycle hooks
[source,yaml]
----
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
...
spec:
lifecycleHooks:
preDrain: <1>
- name: MigrateImportantApp
owner: my-app-migration-controller
preTerminate: <2>
- name: BackupFileSystem
owner: my-backup-controller
- name: CloudProviderSpecialCase
owner: my-custom-storage-detach-controller <3>
- name: WaitForStorageDetach
owner: my-custom-storage-detach-controller
...
----
<1> A `preDrain` lifecycle hook stanza that contains a single lifecycle hook.
<2> A `preTerminate` lifecycle hook stanza that contains three lifecycle hooks.
<3> A hook-implementing controller that manages two `preTerminate` lifecycle hooks: `CloudProviderSpecialCase` and `WaitForStorageDetach`.

View File

@@ -0,0 +1,29 @@
// Module included in the following assemblies:
//
// * machine_management/deleting-machine.adoc
:_content-type: CONCEPT
[id="machine-lifecycle-hook-deletion-uses_{context}"]
= Machine deletion lifecycle hook examples for Operator developers
Operators can use lifecycle hooks for the machine deletion phase to modify the machine deletion process. The following examples demonstrate possible ways that an Operator can use this functionality:
[discrete]
[id="machine-lifecycle-hook-deletion-uses-predrain_{context}"]
== Example use cases for `preDrain` lifecycle hooks
Proactively replacing machines:: An Operator can use a `preDrain` lifecycle hook to ensure that a replacement machine is successfully created and joined to the cluster before removing the instance of a deleted machine. This can mitigate the impact of disruptions during machine replacement or of replacement instances that do not initialize promptly.
Implementing custom draining logic:: An Operator can use a `preDrain` lifecycle hook to replace the machine controller draining logic with a different draining controller. By replacing the draining logic, the Operator would have more flexibility and control over the lifecycle of the workloads on each node.
+
For example, the machine controller drain libraries do not support ordering, but a custom drain provider could provide this functionality. By using a custom drain provider, an Operator could prioritize moving mission-critical applications before draining the node to ensure that service interruptions are minimized in cases where cluster capacity is limited.
[discrete]
[id="machine-lifecycle-hook-deletion-uses-preterminate_{context}"]
== Example use cases for `preTerminate` lifecycle hooks
Verifying storage detachment:: An Operator can use a `preTerminate` lifecycle hook to ensure that storage that is attached to a machine is detached before the machine is removed from the infrastructure provider.
Improving log reliability:: After a node is drained, the log exporter daemon requires some time to synchronize logs to the centralized logging system.
+
A logging Operator can use a `preTerminate` lifecycle hook to add a delay between when the node drains and when the machine is removed from the infrastructure provider. This delay would provide time for the Operator to ensure that the main workloads are removed and no longer adding to the log backlog. When no new data is being added to the log backlog, the log exporter can catch up on the synchronization process, thus ensuring that all application logs are captured.

View File

@@ -0,0 +1,79 @@
// Module included in the following assemblies:
//
// * machine_management/deleting-machine.adoc
// Others TBD.
//Placement considerations: Is this general info? Does it go with deletion docs? CPMS docs? etcd docs? Possibly some combo of those, or perhaps etcd as an example of a use case?
:_content-type: CONCEPT
[id="machine-lifecycle-hook-deletion_{context}"]
= Lifecycle hooks for the machine deletion phase
Machine lifecycle hooks are points in the reconciliation lifecycle of a machine where the normal lifecycle process can be interrupted. In the machine `Deleting` phase, these interruptions provide the opportunity for components to modify the machine deletion process.
[id="machine-lifecycle-hook-deletion-terms_{context}"]
== Terminology and definitions
To understand the behavior of lifecycle hooks for the machine deletion phase, you must understand the following concepts:
Reconciliation:: Reconciliation is the process by which a controller attempts to make the real state of the cluster and the objects that it comprises match the requirements in an object specification.
Machine controller:: The machine controller manages the reconciliation lifecycle for a machine. For machines on cloud platforms, the machine controller is the combination of an {product-title} controller and a platform-specific actuator from the cloud provider.
+
In the context of machine deletion, the machine controller performs the following actions:
--
* Drain the node that is backed by the machine.
* Delete the machine instance from the cloud provider.
* Delete the `Node` object.
--
Lifecycle hook:: A defined point in the reconciliation lifecycle of an object where the normal lifecycle process can be interrupted. Components can use a lifecycle hook to inject changes into the process to accomplish a desired outcome.
+
There are two lifecycle hooks in the machine `Deleting` phase:
--
* `preDrain` lifecycle hooks must be resolved before the node that is backed by the machine can be drained.
* `preTerminate` lifecycle hooks must be resolved before the instance can be removed from the infrastructure provider.
--
Hook-implementing controller:: A controller, other than the machine controller, that can interact with a lifecycle hook. A hook-implementing controller can do one or more of the following actions:
+
--
* Add a lifecycle hook.
* Respond to a lifecycle hook.
* Remove a lifecycle hook.
--
+
Each lifecycle hook has a single hook-implementing controller, but a hook-implementing controller can manage one or more hooks.
[id="machine-lifecycle-hook-deletion-order_{context}"]
== Machine deletion processing order
In {product-title} {product-version}, there are two lifecycle hooks for the machine deletion phase: `preDrain` and `preTerminate`. When all hooks for a given lifecycle point are removed, reconciliation continues as normal.
.Machine deletion flow
image::310_OpenShift_machine_deletion_hooks_0223.png["The sequence of events in the machine `Deleting` phase."]
The machine `Deleting` phase proceeds in the following order:
. An existing machine is slated for deletion for one of the following reasons:
** A user with `cluster-admin` permissions uses the `oc delete machine` command.
** The machine gets a `machine.openshift.io/delete-machine` annotation.
** The machine set that manages the machine marks it for deletion to reduce the replica count as part of reconciliation.
** The cluster autoscaler identifies a node that is unnecessary to meet the deployment needs of the cluster.
** A machine health check is configured to replace an unhealthy machine.
. The machine enters the `Deleting` phase, in which it is marked for deletion but is still present in the API.
. If a `preDrain` lifecycle hook exists, the hook-implementing controller that manages it does a specified action.
+
Until all `preDrain` lifecycle hooks are satisfied, the machine status condition `Drainable` is set to `False`.
. There are no unresolved `preDrain` lifecycle hooks and the machine status condition `Drainable` is set to `True`.
. The machine controller attempts to drain the node that is backed by the machine.
** If draining fails, `Drained` is set to `False` and the machine controller attempts to drain the node again.
** If draining succeeds, `Drained` is set to `True`.
. The machine status condition `Drained` is set to `True`.
. If a `preTerminate` lifecycle hook exists, the hook-implementing controller that manages it does a specified action.
+
Until all `preTerminate` lifecycle hooks are satisfied, the machine status condition `Terminable` is set to `False`.
. There are no unresolved `preTerminate` lifecycle hooks and the machine status condition `Terminable` is set to `True`.
. The machine controller removes the instance from the infrastructure provider.
. The machine controller deletes the `Node` object.

View File

@@ -42,18 +42,6 @@ $ oc get machine -n openshift-machine-api
$ oc annotate machine/<machine_name> -n openshift-machine-api machine.openshift.io/delete-machine="true"
----
. Cordon and drain the node that you want to delete by running the following commands:
+
[source,terminal]
----
$ oc adm cordon <node_name>
----
+
[source,terminal]
----
$ oc adm drain <node_name>
----
. Scale the compute machine set by running one of the following commands:
+
[source,terminal]
@@ -85,6 +73,13 @@ spec:
====
+
You can scale the compute machine set up or down. It takes several minutes for the new machines to be available.
+
[IMPORTANT]
====
By default, the machine controller tries to drain the node that is backed by the machine until it succeeds. In some situations, such as with a misconfigured pod disruption budget, the drain operation might not be able to succeed. If the drain operation fails, the machine controller cannot proceed removing the machine.
You can skip draining the node by annotating `machine.openshift.io/exclude-node-draining` in a specific machine.
====
.Verification
@@ -93,4 +88,4 @@ You can scale the compute machine set up or down. It takes several minutes for t
[source,terminal]
----
$ oc get machines
----
----

View File

@@ -10,7 +10,7 @@ This procedure details the steps to replace an etcd member that is unhealthy eit
[NOTE]
====
If your cluster uses a control plane machine set, see "Troubleshooting the control plane machine set" for a more simple etcd recovery procedure.
If your cluster uses a control plane machine set, see "Recovering a degraded etcd Operator" in "Troubleshooting the control plane machine set" for a more simple etcd recovery procedure.
====
.Prerequisites

View File

@@ -23,6 +23,10 @@ As a developer, you can perform the following Operator tasks:
** xref:../operators/user/olm-installing-operators-in-namespace.adoc#olm-installing-operators-in-namespace[Install and subscribe an Operator to your namespace].
** xref:../operators/user/olm-creating-apps-from-installed-operators.adoc#olm-creating-apps-from-installed-operators[Create an application from an installed Operator through the web console].
[role="_additional-resources"]
.Additional resources
* xref:../machine_management/deleting-machine.adoc#machine-lifecycle-hook-deletion-uses_deleting-machine[Machine deletion lifecycle hook examples for Operator developers]
[id="operators-overview-administrator-tasks"]
== For administrators