diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml index 622ccb17b2..bd67caa2da 100644 --- a/_topic_maps/_topic_map.yml +++ b/_topic_maps/_topic_map.yml @@ -3539,6 +3539,8 @@ Topics: File: graceful-cluster-shutdown - Name: Restarting a cluster gracefully File: graceful-cluster-restart +- Name: Hibernating a cluster + File: hibernating-cluster - Name: OADP Application backup and restore Dir: application_backup_and_restore Topics: diff --git a/backup_and_restore/hibernating-cluster.adoc b/backup_and_restore/hibernating-cluster.adoc new file mode 100644 index 0000000000..73a84eba32 --- /dev/null +++ b/backup_and_restore/hibernating-cluster.adoc @@ -0,0 +1,41 @@ +:_mod-docs-content-type: ASSEMBLY +[id="hibernating-cluster"] += Hibernating an {product-title} cluster +include::_attributes/common-attributes.adoc[] +:context: hibernating-cluster + +toc::[] + +You can hibernate your {product-title} cluster for up to 90 days. + +// About hibernating a cluster +include::modules/hibernating-cluster-about.adoc[leveloffset=+1] + +[id="hibernating-cluster_prerequisites_{context}"] +== Prerequisites + +* Take an xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[etcd backup] prior to hibernating the cluster. ++ +[IMPORTANT] +==== +It is important to take an etcd backup before hibernating so that your cluster can be restored if you encounter any issues when resuming the cluster. + +For example, the following conditions can cause the resumed cluster to malfunction: + +* etcd data corruption during hibernation +* Node failure due to hardware +* Network connectivity issues + +If your cluster fails to recover, follow the steps to xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state]. +==== + +// Hibernating a cluster +include::modules/hibernating-cluster-hibernate.adoc[leveloffset=+1] + +[role="_additional-resources"] +.Additional resources + +* xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backup-etcd[Backing up etcd] + +// Resuming a hibernated cluster +include::modules/hibernating-cluster-resume.adoc[leveloffset=+1] diff --git a/modules/hibernating-cluster-about.adoc b/modules/hibernating-cluster-about.adoc new file mode 100644 index 0000000000..7b88dcc63c --- /dev/null +++ b/modules/hibernating-cluster-about.adoc @@ -0,0 +1,20 @@ +// Module included in the following assemblies: +// +// * backup_and_restore/hibernating-cluster.adoc + +:_mod-docs-content-type: CONCEPT +[id="hibernating-cluster-about_{context}"] += About cluster hibernation + +{product-title} clusters can be hibernated in order to save money on cloud hosting costs. You can hibernate your {product-title} cluster for up to 90 days and expect it to resume successfully. + +You must wait at least 24 hours after cluster installation before hibernating your cluster to allow for the first certification rotation. + +[IMPORTANT] +==== +If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: link:https://www.redhat.com/en/blog/enabling-openshift-4-clusters-to-stop-and-resume-cluster-vms[Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs]. +==== + +When hibernating a cluster, you must hibernate all cluster nodes. It is not supported to suspend only certain nodes. + +After resuming, it can take up to 45 minutes for the cluster to become ready. diff --git a/modules/hibernating-cluster-hibernate.adoc b/modules/hibernating-cluster-hibernate.adoc new file mode 100644 index 0000000000..cc1a7eb97f --- /dev/null +++ b/modules/hibernating-cluster-hibernate.adoc @@ -0,0 +1,97 @@ +// Module included in the following assemblies: +// +// * backup_and_restore/hibernating-cluster.adoc + +:_mod-docs-content-type: PROCEDURE +[id="hibernating-cluster-hibernate_{context}"] += Hibernating a cluster + +You can hibernate a cluster for up to 90 days. The cluster can recover if certificates expire while the cluster was in hibernation. + +.Prerequisites + +* The cluster has been running for at least 24 hours to allow the first certificate rotation to complete. ++ +[IMPORTANT] +==== +If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: link:https://www.redhat.com/en/blog/enabling-openshift-4-clusters-to-stop-and-resume-cluster-vms[Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs]. +==== + +* You have taken an etcd backup. + +* You have access to the cluster as a user with the `cluster-admin` role. + +.Procedure + +. Confirm that your cluster has been installed for at least 24 hours. + +. Ensure that all nodes are in a good state by running the following command: ++ +[source,terminal] +---- +$ oc get nodes +---- ++ +.Example output +[source,terminal] +---- +NAME STATUS ROLES AGE VERSION +ci-ln-812tb4k-72292-8bcj7-master-0 Ready control-plane,master 32m v1.31.3 +ci-ln-812tb4k-72292-8bcj7-master-1 Ready control-plane,master 32m v1.31.3 +ci-ln-812tb4k-72292-8bcj7-master-2 Ready control-plane,master 32m v1.31.3 +Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk Ready worker 19m v1.31.3 +ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv Ready worker 19m v1.31.3 +ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 Ready worker 19m v1.31.3 +---- ++ +All nodes should show `Ready` in the `STATUS` column. + +. Ensure that all cluster Operators are in a good state by running the following command: ++ +[source,terminal] +---- +$ oc get clusteroperators +---- ++ +.Example output +[source,terminal] +---- +NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE +authentication 4.18.0-0 True False False 51m +baremetal 4.18.0-0 True False False 72m +cloud-controller-manager 4.18.0-0 True False False 75m +cloud-credential 4.18.0-0 True False False 77m +cluster-api 4.18.0-0 True False False 42m +cluster-autoscaler 4.18.0-0 True False False 72m +config-operator 4.18.0-0 True False False 72m +console 4.18.0-0 True False False 55m +... +---- ++ +All cluster Operators should show `AVAILABLE`=`True`, `PROGRESSING`=`False`, and `DEGRADED`=`False`. + +. Ensure that all machine config pools are in a good state by running the following command: ++ +[source,terminal] +---- +$ oc get mcp +---- ++ +.Example output +[source,terminal] +---- +NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE +master rendered-master-87871f187930e67233c837e1d07f49c7 True False False 3 3 3 0 96m +worker rendered-worker-3c4c459dc5d90017983d7e72928b8aed True False False 3 3 3 0 96m +---- ++ +All machine config pools should show `UPDATING`=`False` and `DEGRADED`=`False`. + +. Stop the cluster virtual machines: ++ +Use the tools native to your cluster's cloud environment to shut down the cluster's virtual machines. ++ +[IMPORTANT] +==== +If you use a bastion virtual machine, do not shut down this virtual machine. +==== diff --git a/modules/hibernating-cluster-resume.adoc b/modules/hibernating-cluster-resume.adoc new file mode 100644 index 0000000000..8d490adcc3 --- /dev/null +++ b/modules/hibernating-cluster-resume.adoc @@ -0,0 +1,118 @@ +// Module included in the following assemblies: +// +// * backup_and_restore/hibernating-cluster.adoc + +:_mod-docs-content-type: PROCEDURE +[id="hibernating-cluster-resume_{context}"] += Resuming a hibernated cluster + +When you resume a hibernated cluster within 90 days, you might have to approve certificate signing requests (CSRs) for the nodes to become ready. + +It can take around 45 minutes for the cluster to resume, depending on the size of your cluster. + +.Prerequisites + +* You hibernated your cluster less than 90 days ago. +* You have access to the cluster as a user with the `cluster-admin` role. + +.Procedure + +. Within 90 days of cluster hibernation, resume the cluster virtual machines: ++ +Use the tools native to your cluster's cloud environment to resume the cluster's virtual machines. + +. Wait about 5 minutes, depending on the number of nodes in your cluster. + +. Approve CSRs for the nodes: + +.. Check that there is a CSR for each node in the `NotReady` state: ++ +[source,terminal] +---- +$ oc get csr +---- ++ +.Example output +[source,terminal] +---- +NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION +csr-4dwsd 37m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 24h Pending +csr-4vrbr 49m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-master-1 24h Pending +csr-4wk5x 51m kubernetes.io/kubelet-serving system:node:ci-ln-812tb4k-72292-8bcj7-master-1 Pending +csr-84vb6 51m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending +---- + +.. Approve each valid CSR by running the following command: ++ +[source,terminal] +---- +$ oc adm certificate approve +---- + +.. Verify that all necessary CSRs were approved by running the following command: ++ +[source,terminal] +---- +$ oc get csr +---- ++ +.Example output +[source,terminal] +---- +NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION +csr-4dwsd 37m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 24h Approved,Issued +csr-4vrbr 49m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-master-1 24h Approved,Issued +csr-4wk5x 51m kubernetes.io/kubelet-serving system:node:ci-ln-812tb4k-72292-8bcj7-master-1 Approved,Issued +csr-84vb6 51m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued +---- ++ +CSRs should show `Approved,Issued` in the `CONDITION` column. + +. Verify that all nodes now show as ready by running the following command: ++ +[source,terminal] +---- +$ oc get nodes +---- ++ +.Example output +[source,terminal] +---- +NAME STATUS ROLES AGE VERSION +ci-ln-812tb4k-72292-8bcj7-master-0 Ready control-plane,master 32m v1.31.3 +ci-ln-812tb4k-72292-8bcj7-master-1 Ready control-plane,master 32m v1.31.3 +ci-ln-812tb4k-72292-8bcj7-master-2 Ready control-plane,master 32m v1.31.3 +Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk Ready worker 19m v1.31.3 +ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv Ready worker 19m v1.31.3 +ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 Ready worker 19m v1.31.3 +---- ++ +All nodes should show `Ready` in the `STATUS` column. It might take a few minutes for all nodes to become ready after approving the CSRs. + +. Wait for cluster Operators to restart to load the new certificates. ++ +This might take 5 or 10 minutes. + +. Verify that all cluster Operators are in a good state by running the following command: ++ +[source,terminal] +---- +$ oc get clusteroperators +---- ++ +.Example output +[source,terminal] +---- +NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE +authentication 4.18.0-0 True False False 51m +baremetal 4.18.0-0 True False False 72m +cloud-controller-manager 4.18.0-0 True False False 75m +cloud-credential 4.18.0-0 True False False 77m +cluster-api 4.18.0-0 True False False 42m +cluster-autoscaler 4.18.0-0 True False False 72m +config-operator 4.18.0-0 True False False 72m +console 4.18.0-0 True False False 55m +... +---- ++ +All cluster Operators should show `AVAILABLE`=`True`, `PROGRESSING`=`False`, and `DEGRADED`=`False`.