1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 03:47:04 +01:00

OSDOCS#12867: Docs for hibernating a cluster

This commit is contained in:
Andrea Hoffer
2025-01-09 13:07:59 -05:00
committed by openshift-cherrypick-robot
parent e0d6cd8c84
commit d8d13abf06
5 changed files with 278 additions and 0 deletions

View File

@@ -3539,6 +3539,8 @@ Topics:
File: graceful-cluster-shutdown
- Name: Restarting a cluster gracefully
File: graceful-cluster-restart
- Name: Hibernating a cluster
File: hibernating-cluster
- Name: OADP Application backup and restore
Dir: application_backup_and_restore
Topics:

View File

@@ -0,0 +1,41 @@
:_mod-docs-content-type: ASSEMBLY
[id="hibernating-cluster"]
= Hibernating an {product-title} cluster
include::_attributes/common-attributes.adoc[]
:context: hibernating-cluster
toc::[]
You can hibernate your {product-title} cluster for up to 90 days.
// About hibernating a cluster
include::modules/hibernating-cluster-about.adoc[leveloffset=+1]
[id="hibernating-cluster_prerequisites_{context}"]
== Prerequisites
* Take an xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backing-up-etcd-data_backup-etcd[etcd backup] prior to hibernating the cluster.
+
[IMPORTANT]
====
It is important to take an etcd backup before hibernating so that your cluster can be restored if you encounter any issues when resuming the cluster.
For example, the following conditions can cause the resumed cluster to malfunction:
* etcd data corruption during hibernation
* Node failure due to hardware
* Network connectivity issues
If your cluster fails to recover, follow the steps to xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[restore to a previous cluster state].
====
// Hibernating a cluster
include::modules/hibernating-cluster-hibernate.adoc[leveloffset=+1]
[role="_additional-resources"]
.Additional resources
* xref:../backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.adoc#backup-etcd[Backing up etcd]
// Resuming a hibernated cluster
include::modules/hibernating-cluster-resume.adoc[leveloffset=+1]

View File

@@ -0,0 +1,20 @@
// Module included in the following assemblies:
//
// * backup_and_restore/hibernating-cluster.adoc
:_mod-docs-content-type: CONCEPT
[id="hibernating-cluster-about_{context}"]
= About cluster hibernation
{product-title} clusters can be hibernated in order to save money on cloud hosting costs. You can hibernate your {product-title} cluster for up to 90 days and expect it to resume successfully.
You must wait at least 24 hours after cluster installation before hibernating your cluster to allow for the first certification rotation.
[IMPORTANT]
====
If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: link:https://www.redhat.com/en/blog/enabling-openshift-4-clusters-to-stop-and-resume-cluster-vms[Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs].
====
When hibernating a cluster, you must hibernate all cluster nodes. It is not supported to suspend only certain nodes.
After resuming, it can take up to 45 minutes for the cluster to become ready.

View File

@@ -0,0 +1,97 @@
// Module included in the following assemblies:
//
// * backup_and_restore/hibernating-cluster.adoc
:_mod-docs-content-type: PROCEDURE
[id="hibernating-cluster-hibernate_{context}"]
= Hibernating a cluster
You can hibernate a cluster for up to 90 days. The cluster can recover if certificates expire while the cluster was in hibernation.
.Prerequisites
* The cluster has been running for at least 24 hours to allow the first certificate rotation to complete.
+
[IMPORTANT]
====
If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: link:https://www.redhat.com/en/blog/enabling-openshift-4-clusters-to-stop-and-resume-cluster-vms[Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs].
====
* You have taken an etcd backup.
* You have access to the cluster as a user with the `cluster-admin` role.
.Procedure
. Confirm that your cluster has been installed for at least 24 hours.
. Ensure that all nodes are in a good state by running the following command:
+
[source,terminal]
----
$ oc get nodes
----
+
.Example output
[source,terminal]
----
NAME STATUS ROLES AGE VERSION
ci-ln-812tb4k-72292-8bcj7-master-0 Ready control-plane,master 32m v1.31.3
ci-ln-812tb4k-72292-8bcj7-master-1 Ready control-plane,master 32m v1.31.3
ci-ln-812tb4k-72292-8bcj7-master-2 Ready control-plane,master 32m v1.31.3
Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk Ready worker 19m v1.31.3
ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv Ready worker 19m v1.31.3
ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 Ready worker 19m v1.31.3
----
+
All nodes should show `Ready` in the `STATUS` column.
. Ensure that all cluster Operators are in a good state by running the following command:
+
[source,terminal]
----
$ oc get clusteroperators
----
+
.Example output
[source,terminal]
----
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.18.0-0 True False False 51m
baremetal 4.18.0-0 True False False 72m
cloud-controller-manager 4.18.0-0 True False False 75m
cloud-credential 4.18.0-0 True False False 77m
cluster-api 4.18.0-0 True False False 42m
cluster-autoscaler 4.18.0-0 True False False 72m
config-operator 4.18.0-0 True False False 72m
console 4.18.0-0 True False False 55m
...
----
+
All cluster Operators should show `AVAILABLE`=`True`, `PROGRESSING`=`False`, and `DEGRADED`=`False`.
. Ensure that all machine config pools are in a good state by running the following command:
+
[source,terminal]
----
$ oc get mcp
----
+
.Example output
[source,terminal]
----
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-87871f187930e67233c837e1d07f49c7 True False False 3 3 3 0 96m
worker rendered-worker-3c4c459dc5d90017983d7e72928b8aed True False False 3 3 3 0 96m
----
+
All machine config pools should show `UPDATING`=`False` and `DEGRADED`=`False`.
. Stop the cluster virtual machines:
+
Use the tools native to your cluster's cloud environment to shut down the cluster's virtual machines.
+
[IMPORTANT]
====
If you use a bastion virtual machine, do not shut down this virtual machine.
====

View File

@@ -0,0 +1,118 @@
// Module included in the following assemblies:
//
// * backup_and_restore/hibernating-cluster.adoc
:_mod-docs-content-type: PROCEDURE
[id="hibernating-cluster-resume_{context}"]
= Resuming a hibernated cluster
When you resume a hibernated cluster within 90 days, you might have to approve certificate signing requests (CSRs) for the nodes to become ready.
It can take around 45 minutes for the cluster to resume, depending on the size of your cluster.
.Prerequisites
* You hibernated your cluster less than 90 days ago.
* You have access to the cluster as a user with the `cluster-admin` role.
.Procedure
. Within 90 days of cluster hibernation, resume the cluster virtual machines:
+
Use the tools native to your cluster's cloud environment to resume the cluster's virtual machines.
. Wait about 5 minutes, depending on the number of nodes in your cluster.
. Approve CSRs for the nodes:
.. Check that there is a CSR for each node in the `NotReady` state:
+
[source,terminal]
----
$ oc get csr
----
+
.Example output
[source,terminal]
----
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-4dwsd 37m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 24h Pending
csr-4vrbr 49m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-master-1 24h Pending
csr-4wk5x 51m kubernetes.io/kubelet-serving system:node:ci-ln-812tb4k-72292-8bcj7-master-1 <none> Pending
csr-84vb6 51m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending
----
.. Approve each valid CSR by running the following command:
+
[source,terminal]
----
$ oc adm certificate approve <csr_name>
----
.. Verify that all necessary CSRs were approved by running the following command:
+
[source,terminal]
----
$ oc get csr
----
+
.Example output
[source,terminal]
----
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-4dwsd 37m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 24h Approved,Issued
csr-4vrbr 49m kubernetes.io/kube-apiserver-client system:node:ci-ln-812tb4k-72292-8bcj7-master-1 24h Approved,Issued
csr-4wk5x 51m kubernetes.io/kubelet-serving system:node:ci-ln-812tb4k-72292-8bcj7-master-1 <none> Approved,Issued
csr-84vb6 51m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved,Issued
----
+
CSRs should show `Approved,Issued` in the `CONDITION` column.
. Verify that all nodes now show as ready by running the following command:
+
[source,terminal]
----
$ oc get nodes
----
+
.Example output
[source,terminal]
----
NAME STATUS ROLES AGE VERSION
ci-ln-812tb4k-72292-8bcj7-master-0 Ready control-plane,master 32m v1.31.3
ci-ln-812tb4k-72292-8bcj7-master-1 Ready control-plane,master 32m v1.31.3
ci-ln-812tb4k-72292-8bcj7-master-2 Ready control-plane,master 32m v1.31.3
Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk Ready worker 19m v1.31.3
ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv Ready worker 19m v1.31.3
ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2 Ready worker 19m v1.31.3
----
+
All nodes should show `Ready` in the `STATUS` column. It might take a few minutes for all nodes to become ready after approving the CSRs.
. Wait for cluster Operators to restart to load the new certificates.
+
This might take 5 or 10 minutes.
. Verify that all cluster Operators are in a good state by running the following command:
+
[source,terminal]
----
$ oc get clusteroperators
----
+
.Example output
[source,terminal]
----
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
authentication 4.18.0-0 True False False 51m
baremetal 4.18.0-0 True False False 72m
cloud-controller-manager 4.18.0-0 True False False 75m
cloud-credential 4.18.0-0 True False False 77m
cluster-api 4.18.0-0 True False False 42m
cluster-autoscaler 4.18.0-0 True False False 72m
config-operator 4.18.0-0 True False False 72m
console 4.18.0-0 True False False 55m
...
----
+
All cluster Operators should show `AVAILABLE`=`True`, `PROGRESSING`=`False`, and `DEGRADED`=`False`.