openshift-docs/modules/dr-restoring-cluster-state.adoc

// Module included in the following assemblies:
//
// * disaster_recovery/scenario-2-restoring-cluster-state.adoc
// * post_installation_configuration/cluster-tasks.adoc
// * etcd/etcd-backup-restore/etcd-disaster-recovery.adoc

// Contributors: The documentation for this section changed drastically for 4.18+.

// Contributors: Some changes for the `etcd` restore procedure are only valid for 4.14+.
// In the 4.14+ documentation, OVN-K requires different steps because there is no centralized OVN
// control plane to be converted. For more information, see PR #64939.
// Do not cherry pick from "main" to "enterprise-4.12" or "enterprise-4.13" because the cherry pick
// procedure is different for these versions. Instead, open a separate PR for 4.13 and
// cherry pick to 4.12 or make the updates directly in 4.12.

:_mod-docs-content-type: PROCEDURE
[id="dr-scenario-2-restoring-cluster-state_{context}"]
= Restoring to a previous cluster state for more than one node

You can use a saved etcd backup to restore a previous cluster state or restore a cluster that has lost the majority of control plane hosts.

For high availability (HA) clusters, a three-node HA cluster requires you to shut down etcd on two hosts to avoid a cluster split. On four-node and five-node HA clusters, you must shut down three hosts. Quorum requires a simple majority of nodes. The minimum number of nodes required for quorum on a three-node HA cluster is two. On four-node and five-node HA clusters, the minimum number of nodes required for quorum is three. If you start a new cluster from backup on your recovery host, the other etcd members might still be able to form quorum and continue service.

[NOTE]
====
If your cluster uses a control plane machine set, see "Recovering a degraded etcd Operator" in "Troubleshooting the control plane machine set" for an etcd recovery procedure. For {product-title} on a single node, see "Restoring to a previous cluster state for a single node".
====

[IMPORTANT]
====
When you restore your cluster, you must use an etcd backup that was taken from the same z-stream release. For example, an {product-title} {product-version}.2 cluster must use an etcd backup that was taken from {product-version}.2.
====

.Prerequisites

* Access to the cluster as a user with the `cluster-admin` role through a certificate-based `kubeconfig` file, like the one that was used during installation.
* A healthy control plane host to use as the recovery host.
* You have SSH access to control plane hosts.
* A backup directory containing both the `etcd` snapshot and the resources for the static pods, which were from the same backup. The file names in the directory must be in the following formats: `snapshot_<datetimestamp>.db` and `static_kuberesources_<datetimestamp>.tar.gz`.
* Nodes must be accessible or bootable.

[IMPORTANT]
====
For non-recovery control plane nodes, it is not required to establish SSH connectivity or to stop the static pods. You can delete and re-create other non-recovery, control plane machines, one by one.
====

.Procedure

. Select a control plane host to use as the recovery host. This is the host that you run the restore operation on.

. Establish SSH connectivity to each of the control plane nodes, including the recovery host.
+
`kube-apiserver` becomes inaccessible after the restore process starts, so you cannot access the control plane nodes. For this reason, it is recommended to establish SSH connectivity to each control plane host in a separate terminal.
+
[IMPORTANT]
====
If you do not complete this step, you will not be able to access the control plane hosts to complete the restore procedure, and you will be unable to recover your cluster from this state.
====

. Using SSH, connect to each control plane node and run the following command to disable etcd:
+
[source,terminal]
----
$ sudo -E /usr/local/bin/disable-etcd.sh
----

. Copy the etcd backup directory to the recovery control plane host.
+
This procedure assumes that you copied the `backup` directory containing the etcd snapshot and the resources for the static pods to the `/home/core/` directory of your recovery control plane host.

. Use SSH to connect to the recovery host and restore the cluster from a previous backup by running the following command:
+
[source,terminal]
----
$ sudo -E /usr/local/bin/cluster-restore.sh /home/core/<etcd-backup-directory>
----

. Exit the SSH session.

. Once the API responds, turn off the etcd Operator quorum guard by running the following command:
+
[source,terminal]
----
$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'
----

. Monitor the recovery progress of the control plane by running the following command:
+
[source,terminal]
----
$ oc adm wait-for-stable-cluster
----
+
[NOTE]
====
It can take up to 15 minutes for the control plane to recover.
====

. Once recovered, enable the quorum guard by running the following command:
+
[source,terminal]
----
$ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": null}}'
----

.Troubleshooting

If you see no progress rolling out the etcd static pods, you can force redeployment from the `cluster-etcd-operator` by running the following command:

[source,terminal]
----
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$(date --rfc-3339=ns )"'"}}' --type=merge
----