mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-05 12:46:18 +01:00
296 lines
8.1 KiB
Plaintext
296 lines
8.1 KiB
Plaintext
// Module included in the following assembly:
|
|
//
|
|
// * hosted_control_planes/hcp_high_availability/hcp-backup-restore-on-premise.adoc
|
|
|
|
:_mod-docs-content-type: PROCEDURE
|
|
[id="hosted-cluster-etcd-backup-restore-on-premise_{context}"]
|
|
= Backing up and restoring etcd on a hosted cluster in an on-premise environment
|
|
|
|
By backing up and restoring etcd on a hosted cluster, you can fix failures, such as corrupted or missing data in an etcd member of a three node cluster. If multiple members of the etcd cluster encounter data loss or have a `CrashLoopBackOff` status, this approach helps prevent an etcd quorum loss.
|
|
|
|
:FeatureName: Restoring etcd on a different management cluster for bare metal
|
|
include::snippets/technology-preview.adoc[]
|
|
|
|
.Prerequisites
|
|
|
|
* The `oc` and `jq` binaries have been installed.
|
|
|
|
.Procedure
|
|
|
|
. First, set up your environment variables:
|
|
+
|
|
.. Set up environment variables for your hosted cluster by entering the following commands, replacing values as necessary:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ CLUSTER_NAME=my-cluster
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ HOSTED_CLUSTER_NAMESPACE=clusters
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ CONTROL_PLANE_NAMESPACE="${HOSTED_CLUSTER_NAMESPACE}-${CLUSTER_NAME}"
|
|
----
|
|
+
|
|
.. Pause reconciliation of the hosted cluster by entering the following command, replacing values as necessary:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} \
|
|
-p '{"spec":{"pausedUntil":"true"}}' --type=merge
|
|
----
|
|
|
|
. Next, take a snapshot of etcd by using one of the following methods:
|
|
+
|
|
.. Use a previously backed-up snapshot of etcd.
|
|
+
|
|
.. If you have an available etcd pod, take a snapshot from the active etcd pod by completing the following steps:
|
|
+
|
|
... List etcd pods by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd
|
|
----
|
|
+
|
|
... Take a snapshot of the pod database and save it locally to your machine by entering the following commands:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ ETCD_POD=etcd-0
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- \
|
|
env ETCDCTL_API=3 /usr/bin/etcdctl \
|
|
--cacert /etc/etcd/tls/etcd-ca/ca.crt \
|
|
--cert /etc/etcd/tls/client/etcd-client.crt \
|
|
--key /etc/etcd/tls/client/etcd-client.key \
|
|
--endpoints=https://localhost:2379 \
|
|
snapshot save /var/lib/snapshot.db
|
|
----
|
|
+
|
|
... Verify that the snapshot is successful by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- \
|
|
env ETCDCTL_API=3 /usr/bin/etcdctl -w table snapshot status \
|
|
/var/lib/snapshot.db
|
|
----
|
|
+
|
|
.. Make a local copy of the snapshot by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/snapshot.db \
|
|
/tmp/etcd.snapshot.db
|
|
----
|
|
+
|
|
... Make a copy of the snapshot database from etcd persistent storage:
|
|
+
|
|
.... List etcd pods by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd
|
|
----
|
|
+
|
|
.... Find a pod that is running and set its name as the value of `ETCD_POD: ETCD_POD=etcd-0`, and then copy its snapshot database by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc cp -c etcd \
|
|
${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/data/member/snap/db \
|
|
/tmp/etcd.snapshot.db
|
|
----
|
|
|
|
. Next, scale down the etcd statefulset by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=0
|
|
----
|
|
+
|
|
.. Delete volumes for second and third members by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pvc/data-etcd-2
|
|
----
|
|
+
|
|
.. Create a pod to access the first etcd member's data:
|
|
+
|
|
... Get the etcd image by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ ETCD_IMAGE=$(oc get -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd \
|
|
-o jsonpath='{ .spec.template.spec.containers[0].image }')
|
|
----
|
|
+
|
|
... Create a pod that allows access to etcd data:
|
|
+
|
|
[source,yaml,subs="attributes+"]
|
|
----
|
|
$ cat << EOF | oc apply -n ${CONTROL_PLANE_NAMESPACE} -f -
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: etcd-data
|
|
spec:
|
|
replicas: 1
|
|
selector:
|
|
matchLabels:
|
|
app: etcd-data
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: etcd-data
|
|
spec:
|
|
containers:
|
|
- name: access
|
|
image: $ETCD_IMAGE
|
|
volumeMounts:
|
|
- name: data
|
|
mountPath: /var/lib
|
|
command:
|
|
- /usr/bin/bash
|
|
args:
|
|
- -c
|
|
- |-
|
|
while true; do
|
|
sleep 1000
|
|
done
|
|
volumes:
|
|
- name: data
|
|
persistentVolumeClaim:
|
|
claimName: data-etcd-0
|
|
EOF
|
|
----
|
|
+
|
|
... Check the status of the `etcd-data` pod and wait for it to be running by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd-data
|
|
----
|
|
+
|
|
... Get the name of the `etcd-data` pod by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ DATA_POD=$(oc get -n ${CONTROL_PLANE_NAMESPACE} pods --no-headers \
|
|
-l app=etcd-data -o name | cut -d/ -f2)
|
|
----
|
|
+
|
|
.. Copy an etcd snapshot into the pod by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc cp /tmp/etcd.snapshot.db \
|
|
${CONTROL_PLANE_NAMESPACE}/${DATA_POD}:/var/lib/restored.snap.db
|
|
----
|
|
+
|
|
.. Remove old data from the `etcd-data` pod by entering the following commands:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- rm -rf /var/lib/data
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- mkdir -p /var/lib/data
|
|
----
|
|
+
|
|
.. Restore the etcd snapshot by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- \
|
|
etcdutl snapshot restore /var/lib/restored.snap.db \
|
|
--data-dir=/var/lib/data --skip-hash-check \
|
|
--name etcd-0 \
|
|
--initial-cluster-token=etcd-cluster \
|
|
--initial-cluster etcd-0=https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-1=https://etcd-1.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-2=https://etcd-2.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380 \
|
|
--initial-advertise-peer-urls https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380
|
|
----
|
|
+
|
|
.. Remove the temporary etcd snapshot from the pod by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- \
|
|
rm /var/lib/restored.snap.db
|
|
----
|
|
+
|
|
.. Delete data access deployment by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} deployment/etcd-data
|
|
----
|
|
+
|
|
.. Scale up the etcd cluster by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=3
|
|
----
|
|
+
|
|
.. Wait for the etcd member pods to return and report as available by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd -w
|
|
----
|
|
|
|
. Restore reconciliation of the hosted cluster by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc patch -n ${HOSTED_CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} \
|
|
-p '{"spec":{"pausedUntil":"null"}}' --type=merge
|
|
----
|
|
|
|
. Manually roll out the hosted cluster by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc annotate hostedcluster -n \
|
|
<hosted_cluster_namespace> <hosted_cluster_name> \
|
|
hypershift.openshift.io/restart-date=$(date --iso-8601=seconds)
|
|
----
|
|
+
|
|
The Multus admission controller and network node identity pods do not start yet.
|
|
|
|
. Delete the pods for the second and third members of etcd and their PVCs by entering the following commands:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pod/etcd-1 --wait=false
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-2 pod/etcd-2 --wait=false
|
|
----
|
|
|
|
. Manually roll out the hosted cluster again by entering the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc annotate hostedcluster -n \
|
|
<hosted_cluster_namespace> <hosted_cluster_name> \
|
|
hypershift.openshift.io/restart-date=$(date --iso-8601=seconds) \
|
|
--overwrite
|
|
----
|
|
+
|
|
After a few minutes, the control plane pods start running.
|