1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 21:46:22 +01:00

Separating command and output for backup and restore book

This commit is contained in:
Andrea Hoffer
2020-08-04 12:07:20 -04:00
committed by openshift-cherrypick-robot
parent 4e3f13d4d0
commit bd985cc1fc
9 changed files with 194 additions and 8 deletions

View File

@@ -31,8 +31,14 @@ You can check whether the proxy is enabled by reviewing the output of `oc get pr
+
Be sure to pass in the `-E` flag to `sudo` so that environment variables are properly passed to the script.
+
[source,terminal]
----
$ sudo -E /usr/local/bin/cluster-backup.sh ./assets/backup
----
+
.Example script output
[source,terminal]
----
1bf371f1b5a483927cd01bb593b0e12cff406eb8d7d0acf4ab079c36a0abd3f7
etcdctl version: 3.3.18
API version: 3.3

View File

@@ -15,12 +15,14 @@ Use the following steps to approve the pending `node-bootstrapper` CSRs.
. Get the list of current CSRs:
+
[source,terminal]
----
$ oc get csr
----
. Review the details of a CSR to verify that it is valid:
+
[source,terminal]
----
$ oc describe csr <csr_name> <1>
----
@@ -28,6 +30,7 @@ $ oc describe csr <csr_name> <1>
. Approve each valid `node-bootstrapper` CSR:
+
[source,terminal]
----
$ oc adm certificate approve <csr_name>
----

View File

@@ -41,12 +41,14 @@ It is not required to manually stop the Pods on the recovery host. The recovery
.. Move the existing etcd Pod file out of the kubelet manifest directory:
+
[source,terminal]
----
[core@ip-10-0-154-194 ~]$ sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp
----
.. Verify that the etcd Pods are stopped.
+
[source,terminal]
----
[core@ip-10-0-154-194 ~]$ sudo crictl ps | grep etcd
----
@@ -55,12 +57,14 @@ The output of this command should be empty.
.. Move the existing Kubernetes API server Pod file out of the kubelet manifest directory:
+
[source,terminal]
----
[core@ip-10-0-154-194 ~]$ sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp
----
.. Move the etcd data directory to a different location:
+
[source,terminal]
----
[core@ip-10-0-154-194 ~]$ sudo mv /var/lib/etcd/ /tmp
----
@@ -79,8 +83,14 @@ You can check whether the proxy is enabled by reviewing the output of `oc get pr
. Run the restore script on the recovery master host and pass in the path to the etcd backup directory:
+
[source,terminal]
----
[core@ip-10-0-143-125 ~]$ sudo -E /usr/local/bin/cluster-restore.sh /home/core/backup
----
+
.Example script output
[source,terminal]
----
...stopping kube-scheduler-pod.yaml
...stopping kube-controller-manager-pod.yaml
...stopping etcd-pod.yaml
@@ -111,6 +121,7 @@ static-pod-resources/kube-scheduler-pod-8/kube-scheduler-pod.yaml
.. From the recovery host, run the following command:
+
[source,terminal]
----
[core@ip-10-0-143-125 ~]$ sudo systemctl restart kubelet.service
----
@@ -121,16 +132,27 @@ static-pod-resources/kube-scheduler-pod-8/kube-scheduler-pod.yaml
.. From the recovery host, verify that the etcd container is running.
+
[source,terminal]
----
[core@ip-10-0-143-125 ~]$ sudo crictl ps | grep etcd
----
+
.Example output
[source,terminal]
----
3ad41b7908e32 36f86e2eeaaffe662df0d21041eb22b8198e0e58abeeae8c743c3e6e977e8009 About a minute ago Running etcd 0 7c05f8af362f0
----
.. From the recovery host, verify that the etcd Pod is running.
+
[source,terminal]
----
[core@ip-10-0-143-125 ~]$ oc get pods -n openshift-etcd | grep etcd
----
+
.Example output
[source,terminal]
----
NAME READY STATUS RESTARTS AGE
etcd-ip-10-0-143-125.ec2.internal 1/1 Running 1 2m47s
----
@@ -141,6 +163,7 @@ If the status is `Pending`, or the output lists more than one running etcd Pod,
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge <1>
----
@@ -152,12 +175,14 @@ When the etcd cluster Operator performs a redeployment, the existing nodes are s
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
----
+
Review the `NodeInstallerProgressing` status condition for etcd to verify that all nodes are at the latest revision. The output shows `AllNodesAtLatestRevision` upon successful update:
+
[source,terminal]
----
AllNodesAtLatestRevision
3 nodes are at revision 3
@@ -171,18 +196,21 @@ In a terminal that has access to the cluster as a `cluster-admin` user, run the
.. Update the `kubeapiserver`:
+
[source,terminal]
----
$ oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
----
+
Verify all nodes are updated to the latest revision.
+
[source,terminal]
----
$ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
----
+
Review the `NodeInstallerProgressing` status condition to verify that all nodes are at the latest revision. The output shows `AllNodesAtLatestRevision` upon successful update:
+
[source,terminal]
----
AllNodesAtLatestRevision
3 nodes are at revision 3
@@ -190,18 +218,21 @@ AllNodesAtLatestRevision
.. Update the `kubecontrollermanager`:
+
[source,terminal]
----
$ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
----
+
Verify all nodes are updated to the latest revision.
+
[source,terminal]
----
$ oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
----
+
Review the `NodeInstallerProgressing` status condition to verify that all nodes are at the latest revision. The output shows `AllNodesAtLatestRevision` upon successful update:
+
[source,terminal]
----
AllNodesAtLatestRevision
3 nodes are at revision 3
@@ -209,18 +240,21 @@ AllNodesAtLatestRevision
.. Update the `kubescheduler`:
+
[source,terminal]
----
$ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
----
+
Verify all nodes are updated to the latest revision.
+
[source,terminal]
----
$ oc get kubescheduler -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
----
+
Review the `NodeInstallerProgressing` status condition to verify that all nodes are at the latest revision. The output shows `AllNodesAtLatestRevision` upon successful update:
+
[source,terminal]
----
AllNodesAtLatestRevision
3 nodes are at revision 3
@@ -230,8 +264,14 @@ AllNodesAtLatestRevision
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc get pods -n openshift-etcd | grep etcd
----
+
.Example output
[source,terminal]
----
etcd-ip-10-0-143-125.ec2.internal 2/2 Running 0 9h
etcd-ip-10-0-154-194.ec2.internal 2/2 Running 0 9h
etcd-ip-10-0-173-171.ec2.internal 2/2 Running 0 9h

View File

@@ -24,12 +24,14 @@ Wait approximately 10 minutes before continuing to check the status of master no
. Verify that all master nodes are ready.
+
[source,terminal]
----
$ oc get nodes -l node-role.kubernetes.io/master
----
+
The master nodes are ready if the status is `Ready`, as shown in the following output:
+
[source,terminal]
----
NAME STATUS ROLES AGE VERSION
ip-10-0-168-251.ec2.internal Ready master 75m v1.18.3
@@ -41,12 +43,14 @@ ip-10-0-211-16.ec2.internal Ready master 75m v1.18.3
.. Get the list of current CSRs:
+
[source,terminal]
----
$ oc get csr
----
.. Review the details of a CSR to verify that it is valid:
+
[source,terminal]
----
$ oc describe csr <csr_name> <1>
----
@@ -54,18 +58,21 @@ $ oc describe csr <csr_name> <1>
.. Approve each valid CSR:
+
[source,terminal]
----
$ oc adm certificate approve <csr_name>
----
. After the master nodes are ready, verify that all worker nodes are ready.
+
[source,terminal]
----
$ oc get nodes -l node-role.kubernetes.io/worker
----
+
The worker nodes are ready if the status is `Ready`, as shown in the following output:
+
[source,terminal]
----
NAME STATUS ROLES AGE VERSION
ip-10-0-179-95.ec2.internal Ready worker 64m v1.18.3
@@ -77,12 +84,14 @@ ip-10-0-250-100.ec2.internal Ready worker 64m v1.18.3
.. Get the list of current CSRs:
+
[source,terminal]
----
$ oc get csr
----
.. Review the details of a CSR to verify that it is valid:
+
[source,terminal]
----
$ oc describe csr <csr_name> <1>
----
@@ -90,6 +99,7 @@ $ oc describe csr <csr_name> <1>
.. Approve each valid CSR:
+
[source,terminal]
----
$ oc adm certificate approve <csr_name>
----
@@ -98,12 +108,14 @@ $ oc adm certificate approve <csr_name>
.. Check that there are no degraded cluster Operators.
+
[source,terminal]
----
$ oc get clusteroperators
----
+
Check that there are no cluster Operators with the `DEGRADED` condition set to `True`.
+
[source,terminal]
----
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.5.0 True False False 59m
@@ -119,12 +131,14 @@ etcd 4.5.0 True False F
.. Check that all nodes are in the ready state:
+
[source,terminal]
----
$ oc get nodes
----
+
Check that the status for all nodes is `Ready`.
+
[source,terminal]
----
NAME STATUS ROLES AGE VERSION
ip-10-0-168-251.ec2.internal Ready master 82m v1.18.3

View File

@@ -20,9 +20,18 @@ It is important to take an etcd backup before performing this procedure so that
.Procedure
. Shut down all of the nodes in the cluster. You can do this from your cloud provider's web console, or you can use the below commands:
.. Obtain the list of nodes:
+
[source,terminal]
----
$ nodes=$(oc get nodes -o name)
----
.. Shut down all of the nodes:
+
[source,terminal]
----
$ for node in ${nodes[@]}
do
echo "==== Shut down $node ===="

View File

@@ -26,8 +26,14 @@ If you are aware that the machine is not running or the node is not ready, but y
. Determine if the *machine is not running*:
+
[source,terminal]
----
$ oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}' | grep -v running
----
+
.Example output
[source,terminal]
----
ip-10-0-131-183.ec2.internal stopped <1>
----
<1> This output lists the node and the status of the node's machine. If the status is anything other than `running`, then the *machine is not running*.
@@ -42,16 +48,28 @@ If either of the following scenarios are true, then the *node is not ready*.
** If the machine is running, then check whether the node is unreachable:
+
[source,terminal]
----
$ oc get nodes -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{"\t"}{range .spec.taints[*]}{.key}{" "}' | grep unreachable
----
+
.Example output
[source,terminal]
----
ip-10-0-131-183.ec2.internal node-role.kubernetes.io/master node.kubernetes.io/unreachable node.kubernetes.io/unreachable <1>
----
<1> If the node is listed with an `unreachable` taint, then the *node is not ready*.
** If the node is still reachable, then check whether the node is listed as `NotReady`:
+
[source,terminal]
----
$ oc get nodes -l node-role.kubernetes.io/master | grep "NotReady"
----
+
.Example output
[source,terminal]
----
ip-10-0-131-183.ec2.internal NotReady master 122m v1.18.3 <1>
----
<1> If the node is listed as `NotReady`, then the *node is not ready*.
@@ -67,8 +85,14 @@ If the machine is running and the node is ready, then check whether the etcd Pod
.. Verify that all master nodes are listed as `Ready`:
+
[source,terminal]
----
$ oc get nodes -l node-role.kubernetes.io/master
----
+
.Example output
[source,terminal]
----
NAME STATUS ROLES AGE VERSION
ip-10-0-131-183.ec2.internal Ready master 6h13m v1.18.3
ip-10-0-164-97.ec2.internal Ready master 6h13m v1.18.3
@@ -77,8 +101,14 @@ ip-10-0-154-204.ec2.internal Ready master 6h13m v1.18.3
.. Check whether the status of an etcd Pod is either `Error` or `CrashloopBackoff`:
+
[source,terminal]
----
$ oc get pods -n openshift-etcd | grep etcd
----
+
.Example output
[source,terminal]
----
etcd-ip-10-0-131-183.ec2.internal 2/3 Error 7 6h9m <1>
etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 6h6m
etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 6h6m

View File

@@ -15,12 +15,14 @@ You can identify if your cluster has an unhealthy etcd member.
. Check the status of the `EtcdMembersAvailable` status condition using the following command:
+
[source,terminal]
----
$ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="EtcdMembersAvailable")]}{.message}{"\n"}'
----
. Review the output:
+
[source,terminal]
----
2 of 3 members are available, ip-10-0-131-183.ec2.internal is unhealthy
----

View File

@@ -27,6 +27,7 @@ It is important to take an etcd backup before performing this procedure so that
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc debug node/ip-10-0-131-183.ec2.internal <1>
----
@@ -34,18 +35,21 @@ $ oc debug node/ip-10-0-131-183.ec2.internal <1>
.. Change your root directory to the host:
+
[source,terminal]
----
sh-4.2# chroot /host
----
.. Move the existing etcd Pod file out of the kubelet manifest directory:
+
[source,terminal]
----
sh-4.2# mv /etc/kubernetes/manifests/etcd-pod.yaml /var/lib/etcd-backup/
----
.. Move the etcd data directory to a different location:
+
[source,terminal]
----
sh-4.2# mv /var/lib/etcd/ /tmp
----
@@ -58,8 +62,14 @@ You can now exit the node shell.
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc get pods -n openshift-etcd | grep etcd
----
+
.Example output
[source,terminal]
----
etcd-ip-10-0-131-183.ec2.internal 2/3 Error 7 6h9m
etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 6h6m
etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 6h6m
@@ -69,15 +79,21 @@ etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc rsh -n openshift-etcd etcd-ip-10-0-154-204.ec2.internal
----
.. View the member list:
+
[source,terminal]
----
sh-4.2# etcdctl member list -w table
----
+
.Example output
[source,terminal]
----
+------------------+---------+------------------------------+---------------------------+---------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+------------------------------+---------------------------+---------------------------+
@@ -89,16 +105,27 @@ sh-4.2# etcdctl member list -w table
.. Remove the unhealthy etcd member by providing the ID to the `etcdctl member remove` command:
+
[source,terminal]
----
sh-4.2# etcdctl member remove 62bcf33650a7170a
----
+
.Example output
[source,terminal]
----
Member 62bcf33650a7170a removed from cluster ead669ce1fbfb346
----
.. View the member list again and verify that the member was removed:
+
[source,terminal]
----
sh-4.2# etcdctl member list -w table
----
+
.Example output
[source,terminal]
----
+------------------+---------+------------------------------+---------------------------+---------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+------------------------------+---------------------------+---------------------------+
@@ -113,6 +140,7 @@ You can now exit the node shell.
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge <1>
----
@@ -126,14 +154,21 @@ When the etcd cluster Operator performs a redeployment, it ensures that all mast
+
In a terminal that has access to the cluster as a cluster-admin user, run the following command:
+
[source,terminal]
----
$ oc rsh -n openshift-etcd etcd-ip-10-0-154-204.ec2.internal
----
.. Verify that all members are healthy:
+
[source,terminal]
----
sh-4.2# etcdctl endpoint health --cluster
----
+
.Example output
[source,terminal]
----
https://10.0.131.183:2379 is healthy: successfully committed proposal: took = 16.671434ms
https://10.0.154.204:2379 is healthy: successfully committed proposal: took = 16.698331ms
https://10.0.164.97:2379 is healthy: successfully committed proposal: took = 16.621645ms

View File

@@ -27,8 +27,14 @@ It is important to take an etcd backup before performing this procedure so that
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc get pods -n openshift-etcd | grep etcd
----
+
.Example output
[source,terminal]
----
etcd-ip-10-0-131-183.ec2.internal 3/3 Running 0 123m
etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 123m
etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 124m
@@ -38,15 +44,21 @@ etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc rsh -n openshift-etcd etcd-ip-10-0-154-204.ec2.internal
----
.. View the member list:
+
[source,terminal]
----
sh-4.2# etcdctl member list -w table
----
+
.Example output
[source,terminal]
----
+------------------+---------+------------------------------+---------------------------+---------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+------------------------------+---------------------------+---------------------------+
@@ -58,16 +70,27 @@ sh-4.2# etcdctl member list -w table
.. Remove the unhealthy etcd member by providing the ID to the `etcdctl member remove` command:
+
[source,terminal]
----
sh-4.2# etcdctl member remove 6fc1e7c9db35841d
----
+
.Example output
[source,terminal]
----
Member 6fc1e7c9db35841d removed from cluster baa565c8919b060e
----
.. View the member list again and verify that the member was removed:
+
[source,terminal]
----
sh-4.2# etcdctl member list -w table
----
+
.Example output
[source,terminal]
----
+------------------+---------+------------------------------+---------------------------+---------------------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
+------------------+---------+------------------------------+---------------------------+---------------------------+
@@ -86,9 +109,14 @@ If you are running installer-provisioned infrastructure, or you used the Machine
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc get machines -n openshift-machine-api -o wide
----
+
.Example output
[source,terminal]
----
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
clustername-8qw5l-master-0 Running m4.xlarge us-east-1 us-east-1a 3h37m ip-10-0-131-183.ec2.internal aws:///us-east-1a/i-0ec2782f8287dfb7e stopped <1>
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
@@ -101,6 +129,7 @@ clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us
.. Save the machine configuration to a file on your file system:
+
[source,terminal]
----
$ oc get machine clustername-8qw5l-master-0 \ <1>
-n openshift-machine-api \
@@ -180,6 +209,7 @@ metadata:
.. Delete the machine of the unhealthy member:
+
[source,terminal]
----
$ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 <1>
----
@@ -187,9 +217,14 @@ $ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 <1>
.. Verify that the machine was deleted:
+
[source,terminal]
----
$ oc get machines -n openshift-machine-api -o wide
----
+
.Example output
[source,terminal]
----
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
@@ -200,6 +235,7 @@ clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us
.. Create the new machine using the `new-master-machine.yaml` file:
+
[source,terminal]
----
$ oc apply -f new-master-machine.yaml
----
@@ -207,9 +243,14 @@ $ oc apply -f new-master-machine.yaml
.. Verify that the new machine has been created:
+
[source,terminal]
----
$ oc get machines -n openshift-machine-api -o wide
----
+
.Example output
[source,terminal]
----
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
@@ -226,8 +267,14 @@ It might take a few minutes for the new machine to be created. The etcd cluster
+
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
+
[source,terminal]
----
$ oc get pods -n openshift-etcd | grep etcd
----
+
.Example output
[source,terminal]
----
etcd-ip-10-0-133-53.ec2.internal 3/3 Running 0 7m49s
etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 123m
etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 124m