1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00
Files
openshift-docs/modules/hosted-cluster-etcd-backup-restore-on-premise.adoc
2026-01-14 08:52:23 +00:00

296 lines
8.1 KiB
Plaintext

// Module included in the following assembly:
//
// * hosted_control_planes/hcp_high_availability/hcp-backup-restore-on-premise.adoc
:_mod-docs-content-type: PROCEDURE
[id="hosted-cluster-etcd-backup-restore-on-premise_{context}"]
= Backing up and restoring etcd on a hosted cluster in an on-premise environment
By backing up and restoring etcd on a hosted cluster, you can fix failures, such as corrupted or missing data in an etcd member of a three node cluster. If multiple members of the etcd cluster encounter data loss or have a `CrashLoopBackOff` status, this approach helps prevent an etcd quorum loss.
:FeatureName: Restoring etcd on a different management cluster for bare metal
include::snippets/technology-preview.adoc[]
.Prerequisites
* The `oc` and `jq` binaries have been installed.
.Procedure
. First, set up your environment variables:
+
.. Set up environment variables for your hosted cluster by entering the following commands, replacing values as necessary:
+
[source,terminal]
----
$ CLUSTER_NAME=my-cluster
----
+
[source,terminal]
----
$ HOSTED_CLUSTER_NAMESPACE=clusters
----
+
[source,terminal]
----
$ CONTROL_PLANE_NAMESPACE="${HOSTED_CLUSTER_NAMESPACE}-${CLUSTER_NAME}"
----
+
.. Pause reconciliation of the hosted cluster by entering the following command, replacing values as necessary:
+
[source,terminal]
----
$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} \
-p '{"spec":{"pausedUntil":"true"}}' --type=merge
----
. Next, take a snapshot of etcd by using one of the following methods:
+
.. Use a previously backed-up snapshot of etcd.
+
.. If you have an available etcd pod, take a snapshot from the active etcd pod by completing the following steps:
+
... List etcd pods by entering the following command:
+
[source,terminal]
----
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd
----
+
... Take a snapshot of the pod database and save it locally to your machine by entering the following commands:
+
[source,terminal]
----
$ ETCD_POD=etcd-0
----
+
[source,terminal]
----
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- \
env ETCDCTL_API=3 /usr/bin/etcdctl \
--cacert /etc/etcd/tls/etcd-ca/ca.crt \
--cert /etc/etcd/tls/client/etcd-client.crt \
--key /etc/etcd/tls/client/etcd-client.key \
--endpoints=https://localhost:2379 \
snapshot save /var/lib/snapshot.db
----
+
... Verify that the snapshot is successful by entering the following command:
+
[source,terminal]
----
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- \
env ETCDCTL_API=3 /usr/bin/etcdctl -w table snapshot status \
/var/lib/snapshot.db
----
+
.. Make a local copy of the snapshot by entering the following command:
+
[source,terminal]
----
$ oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/snapshot.db \
/tmp/etcd.snapshot.db
----
+
... Make a copy of the snapshot database from etcd persistent storage:
+
.... List etcd pods by entering the following command:
+
[source,terminal]
----
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd
----
+
.... Find a pod that is running and set its name as the value of `ETCD_POD: ETCD_POD=etcd-0`, and then copy its snapshot database by entering the following command:
+
[source,terminal]
----
$ oc cp -c etcd \
${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/data/member/snap/db \
/tmp/etcd.snapshot.db
----
. Next, scale down the etcd statefulset by entering the following command:
+
[source,terminal]
----
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=0
----
+
.. Delete volumes for second and third members by entering the following command:
+
[source,terminal]
----
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pvc/data-etcd-2
----
+
.. Create a pod to access the first etcd member's data:
+
... Get the etcd image by entering the following command:
+
[source,terminal]
----
$ ETCD_IMAGE=$(oc get -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd \
-o jsonpath='{ .spec.template.spec.containers[0].image }')
----
+
... Create a pod that allows access to etcd data:
+
[source,yaml,subs="attributes+"]
----
$ cat << EOF | oc apply -n ${CONTROL_PLANE_NAMESPACE} -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: etcd-data
spec:
replicas: 1
selector:
matchLabels:
app: etcd-data
template:
metadata:
labels:
app: etcd-data
spec:
containers:
- name: access
image: $ETCD_IMAGE
volumeMounts:
- name: data
mountPath: /var/lib
command:
- /usr/bin/bash
args:
- -c
- |-
while true; do
sleep 1000
done
volumes:
- name: data
persistentVolumeClaim:
claimName: data-etcd-0
EOF
----
+
... Check the status of the `etcd-data` pod and wait for it to be running by entering the following command:
+
[source,terminal]
----
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd-data
----
+
... Get the name of the `etcd-data` pod by entering the following command:
+
[source,terminal]
----
$ DATA_POD=$(oc get -n ${CONTROL_PLANE_NAMESPACE} pods --no-headers \
-l app=etcd-data -o name | cut -d/ -f2)
----
+
.. Copy an etcd snapshot into the pod by entering the following command:
+
[source,terminal]
----
$ oc cp /tmp/etcd.snapshot.db \
${CONTROL_PLANE_NAMESPACE}/${DATA_POD}:/var/lib/restored.snap.db
----
+
.. Remove old data from the `etcd-data` pod by entering the following commands:
+
[source,terminal]
----
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- rm -rf /var/lib/data
----
+
[source,terminal]
----
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- mkdir -p /var/lib/data
----
+
.. Restore the etcd snapshot by entering the following command:
+
[source,terminal]
----
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- \
etcdutl snapshot restore /var/lib/restored.snap.db \
--data-dir=/var/lib/data --skip-hash-check \
--name etcd-0 \
--initial-cluster-token=etcd-cluster \
--initial-cluster etcd-0=https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-1=https://etcd-1.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-2=https://etcd-2.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380 \
--initial-advertise-peer-urls https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380
----
+
.. Remove the temporary etcd snapshot from the pod by entering the following command:
+
[source,terminal]
----
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- \
rm /var/lib/restored.snap.db
----
+
.. Delete data access deployment by entering the following command:
+
[source,terminal]
----
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} deployment/etcd-data
----
+
.. Scale up the etcd cluster by entering the following command:
+
[source,terminal]
----
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=3
----
+
.. Wait for the etcd member pods to return and report as available by entering the following command:
+
[source,terminal]
----
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd -w
----
. Restore reconciliation of the hosted cluster by entering the following command:
+
[source,terminal]
----
$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} \
-p '{"spec":{"pausedUntil":"null"}}' --type=merge
----
. Manually roll out the hosted cluster by entering the following command:
+
[source,terminal]
----
$ oc annotate hostedcluster -n \
<hosted_cluster_namespace> <hosted_cluster_name> \
hypershift.openshift.io/restart-date=$(date --iso-8601=seconds)
----
+
The Multus admission controller and network node identity pods do not start yet.
. Delete the pods for the second and third members of etcd and their PVCs by entering the following commands:
+
[source,terminal]
----
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pod/etcd-1 --wait=false
----
+
[source,terminal]
----
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-2 pod/etcd-2 --wait=false
----
. Manually roll out the hosted cluster again by entering the following command:
+
[source,terminal]
----
$ oc annotate hostedcluster -n \
<hosted_cluster_namespace> <hosted_cluster_name> \
hypershift.openshift.io/restart-date=$(date --iso-8601=seconds) \
--overwrite
----
+
After a few minutes, the control plane pods start running.