mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-05 21:46:22 +01:00
Separating command and output for backup and restore book
This commit is contained in:
committed by
openshift-cherrypick-robot
parent
4e3f13d4d0
commit
bd985cc1fc
@@ -31,8 +31,14 @@ You can check whether the proxy is enabled by reviewing the output of `oc get pr
|
||||
+
|
||||
Be sure to pass in the `-E` flag to `sudo` so that environment variables are properly passed to the script.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ sudo -E /usr/local/bin/cluster-backup.sh ./assets/backup
|
||||
----
|
||||
+
|
||||
.Example script output
|
||||
[source,terminal]
|
||||
----
|
||||
1bf371f1b5a483927cd01bb593b0e12cff406eb8d7d0acf4ab079c36a0abd3f7
|
||||
etcdctl version: 3.3.18
|
||||
API version: 3.3
|
||||
|
||||
@@ -15,12 +15,14 @@ Use the following steps to approve the pending `node-bootstrapper` CSRs.
|
||||
|
||||
. Get the list of current CSRs:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get csr
|
||||
----
|
||||
|
||||
. Review the details of a CSR to verify that it is valid:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc describe csr <csr_name> <1>
|
||||
----
|
||||
@@ -28,6 +30,7 @@ $ oc describe csr <csr_name> <1>
|
||||
|
||||
. Approve each valid `node-bootstrapper` CSR:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc adm certificate approve <csr_name>
|
||||
----
|
||||
|
||||
@@ -41,12 +41,14 @@ It is not required to manually stop the Pods on the recovery host. The recovery
|
||||
|
||||
.. Move the existing etcd Pod file out of the kubelet manifest directory:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
[core@ip-10-0-154-194 ~]$ sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp
|
||||
----
|
||||
|
||||
.. Verify that the etcd Pods are stopped.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
[core@ip-10-0-154-194 ~]$ sudo crictl ps | grep etcd
|
||||
----
|
||||
@@ -55,12 +57,14 @@ The output of this command should be empty.
|
||||
|
||||
.. Move the existing Kubernetes API server Pod file out of the kubelet manifest directory:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
[core@ip-10-0-154-194 ~]$ sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp
|
||||
----
|
||||
|
||||
.. Move the etcd data directory to a different location:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
[core@ip-10-0-154-194 ~]$ sudo mv /var/lib/etcd/ /tmp
|
||||
----
|
||||
@@ -79,8 +83,14 @@ You can check whether the proxy is enabled by reviewing the output of `oc get pr
|
||||
|
||||
. Run the restore script on the recovery master host and pass in the path to the etcd backup directory:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
[core@ip-10-0-143-125 ~]$ sudo -E /usr/local/bin/cluster-restore.sh /home/core/backup
|
||||
----
|
||||
+
|
||||
.Example script output
|
||||
[source,terminal]
|
||||
----
|
||||
...stopping kube-scheduler-pod.yaml
|
||||
...stopping kube-controller-manager-pod.yaml
|
||||
...stopping etcd-pod.yaml
|
||||
@@ -111,6 +121,7 @@ static-pod-resources/kube-scheduler-pod-8/kube-scheduler-pod.yaml
|
||||
|
||||
.. From the recovery host, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
[core@ip-10-0-143-125 ~]$ sudo systemctl restart kubelet.service
|
||||
----
|
||||
@@ -121,16 +132,27 @@ static-pod-resources/kube-scheduler-pod-8/kube-scheduler-pod.yaml
|
||||
|
||||
.. From the recovery host, verify that the etcd container is running.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
[core@ip-10-0-143-125 ~]$ sudo crictl ps | grep etcd
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
3ad41b7908e32 36f86e2eeaaffe662df0d21041eb22b8198e0e58abeeae8c743c3e6e977e8009 About a minute ago Running etcd 0 7c05f8af362f0
|
||||
----
|
||||
|
||||
.. From the recovery host, verify that the etcd Pod is running.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
[core@ip-10-0-143-125 ~]$ oc get pods -n openshift-etcd | grep etcd
|
||||
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
etcd-ip-10-0-143-125.ec2.internal 1/1 Running 1 2m47s
|
||||
----
|
||||
@@ -141,6 +163,7 @@ If the status is `Pending`, or the output lists more than one running etcd Pod,
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge <1>
|
||||
----
|
||||
@@ -152,12 +175,14 @@ When the etcd cluster Operator performs a redeployment, the existing nodes are s
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
|
||||
----
|
||||
+
|
||||
Review the `NodeInstallerProgressing` status condition for etcd to verify that all nodes are at the latest revision. The output shows `AllNodesAtLatestRevision` upon successful update:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
AllNodesAtLatestRevision
|
||||
3 nodes are at revision 3
|
||||
@@ -171,18 +196,21 @@ In a terminal that has access to the cluster as a `cluster-admin` user, run the
|
||||
|
||||
.. Update the `kubeapiserver`:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
|
||||
----
|
||||
+
|
||||
Verify all nodes are updated to the latest revision.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
|
||||
----
|
||||
+
|
||||
Review the `NodeInstallerProgressing` status condition to verify that all nodes are at the latest revision. The output shows `AllNodesAtLatestRevision` upon successful update:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
AllNodesAtLatestRevision
|
||||
3 nodes are at revision 3
|
||||
@@ -190,18 +218,21 @@ AllNodesAtLatestRevision
|
||||
|
||||
.. Update the `kubecontrollermanager`:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
|
||||
----
|
||||
+
|
||||
Verify all nodes are updated to the latest revision.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
|
||||
----
|
||||
+
|
||||
Review the `NodeInstallerProgressing` status condition to verify that all nodes are at the latest revision. The output shows `AllNodesAtLatestRevision` upon successful update:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
AllNodesAtLatestRevision
|
||||
3 nodes are at revision 3
|
||||
@@ -209,18 +240,21 @@ AllNodesAtLatestRevision
|
||||
|
||||
.. Update the `kubescheduler`:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
|
||||
----
|
||||
+
|
||||
Verify all nodes are updated to the latest revision.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get kubescheduler -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
|
||||
----
|
||||
+
|
||||
Review the `NodeInstallerProgressing` status condition to verify that all nodes are at the latest revision. The output shows `AllNodesAtLatestRevision` upon successful update:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
AllNodesAtLatestRevision
|
||||
3 nodes are at revision 3
|
||||
@@ -230,8 +264,14 @@ AllNodesAtLatestRevision
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get pods -n openshift-etcd | grep etcd
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
etcd-ip-10-0-143-125.ec2.internal 2/2 Running 0 9h
|
||||
etcd-ip-10-0-154-194.ec2.internal 2/2 Running 0 9h
|
||||
etcd-ip-10-0-173-171.ec2.internal 2/2 Running 0 9h
|
||||
|
||||
@@ -24,12 +24,14 @@ Wait approximately 10 minutes before continuing to check the status of master no
|
||||
|
||||
. Verify that all master nodes are ready.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get nodes -l node-role.kubernetes.io/master
|
||||
----
|
||||
+
|
||||
The master nodes are ready if the status is `Ready`, as shown in the following output:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
NAME STATUS ROLES AGE VERSION
|
||||
ip-10-0-168-251.ec2.internal Ready master 75m v1.18.3
|
||||
@@ -41,12 +43,14 @@ ip-10-0-211-16.ec2.internal Ready master 75m v1.18.3
|
||||
|
||||
.. Get the list of current CSRs:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get csr
|
||||
----
|
||||
|
||||
.. Review the details of a CSR to verify that it is valid:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc describe csr <csr_name> <1>
|
||||
----
|
||||
@@ -54,18 +58,21 @@ $ oc describe csr <csr_name> <1>
|
||||
|
||||
.. Approve each valid CSR:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc adm certificate approve <csr_name>
|
||||
----
|
||||
|
||||
. After the master nodes are ready, verify that all worker nodes are ready.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get nodes -l node-role.kubernetes.io/worker
|
||||
----
|
||||
+
|
||||
The worker nodes are ready if the status is `Ready`, as shown in the following output:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
NAME STATUS ROLES AGE VERSION
|
||||
ip-10-0-179-95.ec2.internal Ready worker 64m v1.18.3
|
||||
@@ -77,12 +84,14 @@ ip-10-0-250-100.ec2.internal Ready worker 64m v1.18.3
|
||||
|
||||
.. Get the list of current CSRs:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get csr
|
||||
----
|
||||
|
||||
.. Review the details of a CSR to verify that it is valid:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc describe csr <csr_name> <1>
|
||||
----
|
||||
@@ -90,6 +99,7 @@ $ oc describe csr <csr_name> <1>
|
||||
|
||||
.. Approve each valid CSR:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc adm certificate approve <csr_name>
|
||||
----
|
||||
@@ -98,12 +108,14 @@ $ oc adm certificate approve <csr_name>
|
||||
|
||||
.. Check that there are no degraded cluster Operators.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get clusteroperators
|
||||
----
|
||||
+
|
||||
Check that there are no cluster Operators with the `DEGRADED` condition set to `True`.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
|
||||
authentication 4.5.0 True False False 59m
|
||||
@@ -119,12 +131,14 @@ etcd 4.5.0 True False F
|
||||
|
||||
.. Check that all nodes are in the ready state:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get nodes
|
||||
----
|
||||
+
|
||||
Check that the status for all nodes is `Ready`.
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
NAME STATUS ROLES AGE VERSION
|
||||
ip-10-0-168-251.ec2.internal Ready master 82m v1.18.3
|
||||
|
||||
@@ -20,9 +20,18 @@ It is important to take an etcd backup before performing this procedure so that
|
||||
.Procedure
|
||||
|
||||
. Shut down all of the nodes in the cluster. You can do this from your cloud provider's web console, or you can use the below commands:
|
||||
|
||||
.. Obtain the list of nodes:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ nodes=$(oc get nodes -o name)
|
||||
----
|
||||
|
||||
.. Shut down all of the nodes:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ for node in ${nodes[@]}
|
||||
do
|
||||
echo "==== Shut down $node ===="
|
||||
|
||||
@@ -26,8 +26,14 @@ If you are aware that the machine is not running or the node is not ready, but y
|
||||
|
||||
. Determine if the *machine is not running*:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get machines -A -ojsonpath='{range .items[*]}{@.status.nodeRef.name}{"\t"}{@.status.providerStatus.instanceState}{"\n"}' | grep -v running
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
ip-10-0-131-183.ec2.internal stopped <1>
|
||||
----
|
||||
<1> This output lists the node and the status of the node's machine. If the status is anything other than `running`, then the *machine is not running*.
|
||||
@@ -42,16 +48,28 @@ If either of the following scenarios are true, then the *node is not ready*.
|
||||
|
||||
** If the machine is running, then check whether the node is unreachable:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get nodes -o jsonpath='{range .items[*]}{"\n"}{.metadata.name}{"\t"}{range .spec.taints[*]}{.key}{" "}' | grep unreachable
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
ip-10-0-131-183.ec2.internal node-role.kubernetes.io/master node.kubernetes.io/unreachable node.kubernetes.io/unreachable <1>
|
||||
----
|
||||
<1> If the node is listed with an `unreachable` taint, then the *node is not ready*.
|
||||
|
||||
** If the node is still reachable, then check whether the node is listed as `NotReady`:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get nodes -l node-role.kubernetes.io/master | grep "NotReady"
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
ip-10-0-131-183.ec2.internal NotReady master 122m v1.18.3 <1>
|
||||
----
|
||||
<1> If the node is listed as `NotReady`, then the *node is not ready*.
|
||||
@@ -67,8 +85,14 @@ If the machine is running and the node is ready, then check whether the etcd Pod
|
||||
|
||||
.. Verify that all master nodes are listed as `Ready`:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get nodes -l node-role.kubernetes.io/master
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
NAME STATUS ROLES AGE VERSION
|
||||
ip-10-0-131-183.ec2.internal Ready master 6h13m v1.18.3
|
||||
ip-10-0-164-97.ec2.internal Ready master 6h13m v1.18.3
|
||||
@@ -77,8 +101,14 @@ ip-10-0-154-204.ec2.internal Ready master 6h13m v1.18.3
|
||||
|
||||
.. Check whether the status of an etcd Pod is either `Error` or `CrashloopBackoff`:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get pods -n openshift-etcd | grep etcd
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
etcd-ip-10-0-131-183.ec2.internal 2/3 Error 7 6h9m <1>
|
||||
etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 6h6m
|
||||
etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 6h6m
|
||||
|
||||
@@ -15,12 +15,14 @@ You can identify if your cluster has an unhealthy etcd member.
|
||||
|
||||
. Check the status of the `EtcdMembersAvailable` status condition using the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="EtcdMembersAvailable")]}{.message}{"\n"}'
|
||||
----
|
||||
|
||||
. Review the output:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
2 of 3 members are available, ip-10-0-131-183.ec2.internal is unhealthy
|
||||
----
|
||||
|
||||
@@ -27,6 +27,7 @@ It is important to take an etcd backup before performing this procedure so that
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc debug node/ip-10-0-131-183.ec2.internal <1>
|
||||
----
|
||||
@@ -34,18 +35,21 @@ $ oc debug node/ip-10-0-131-183.ec2.internal <1>
|
||||
|
||||
.. Change your root directory to the host:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# chroot /host
|
||||
----
|
||||
|
||||
.. Move the existing etcd Pod file out of the kubelet manifest directory:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# mv /etc/kubernetes/manifests/etcd-pod.yaml /var/lib/etcd-backup/
|
||||
----
|
||||
|
||||
.. Move the etcd data directory to a different location:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# mv /var/lib/etcd/ /tmp
|
||||
----
|
||||
@@ -58,8 +62,14 @@ You can now exit the node shell.
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get pods -n openshift-etcd | grep etcd
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
etcd-ip-10-0-131-183.ec2.internal 2/3 Error 7 6h9m
|
||||
etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 6h6m
|
||||
etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 6h6m
|
||||
@@ -69,15 +79,21 @@ etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc rsh -n openshift-etcd etcd-ip-10-0-154-204.ec2.internal
|
||||
----
|
||||
|
||||
.. View the member list:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# etcdctl member list -w table
|
||||
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
+------------------+---------+------------------------------+---------------------------+---------------------------+
|
||||
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
|
||||
+------------------+---------+------------------------------+---------------------------+---------------------------+
|
||||
@@ -89,16 +105,27 @@ sh-4.2# etcdctl member list -w table
|
||||
|
||||
.. Remove the unhealthy etcd member by providing the ID to the `etcdctl member remove` command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# etcdctl member remove 62bcf33650a7170a
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
Member 62bcf33650a7170a removed from cluster ead669ce1fbfb346
|
||||
----
|
||||
|
||||
.. View the member list again and verify that the member was removed:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# etcdctl member list -w table
|
||||
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
+------------------+---------+------------------------------+---------------------------+---------------------------+
|
||||
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
|
||||
+------------------+---------+------------------------------+---------------------------+---------------------------+
|
||||
@@ -113,6 +140,7 @@ You can now exit the node shell.
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "single-master-recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge <1>
|
||||
----
|
||||
@@ -126,14 +154,21 @@ When the etcd cluster Operator performs a redeployment, it ensures that all mast
|
||||
+
|
||||
In a terminal that has access to the cluster as a cluster-admin user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc rsh -n openshift-etcd etcd-ip-10-0-154-204.ec2.internal
|
||||
----
|
||||
|
||||
.. Verify that all members are healthy:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# etcdctl endpoint health --cluster
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
https://10.0.131.183:2379 is healthy: successfully committed proposal: took = 16.671434ms
|
||||
https://10.0.154.204:2379 is healthy: successfully committed proposal: took = 16.698331ms
|
||||
https://10.0.164.97:2379 is healthy: successfully committed proposal: took = 16.621645ms
|
||||
|
||||
@@ -27,8 +27,14 @@ It is important to take an etcd backup before performing this procedure so that
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get pods -n openshift-etcd | grep etcd
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
etcd-ip-10-0-131-183.ec2.internal 3/3 Running 0 123m
|
||||
etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 123m
|
||||
etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 124m
|
||||
@@ -38,15 +44,21 @@ etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc rsh -n openshift-etcd etcd-ip-10-0-154-204.ec2.internal
|
||||
----
|
||||
|
||||
.. View the member list:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# etcdctl member list -w table
|
||||
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
+------------------+---------+------------------------------+---------------------------+---------------------------+
|
||||
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
|
||||
+------------------+---------+------------------------------+---------------------------+---------------------------+
|
||||
@@ -58,16 +70,27 @@ sh-4.2# etcdctl member list -w table
|
||||
|
||||
.. Remove the unhealthy etcd member by providing the ID to the `etcdctl member remove` command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# etcdctl member remove 6fc1e7c9db35841d
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
Member 6fc1e7c9db35841d removed from cluster baa565c8919b060e
|
||||
----
|
||||
|
||||
.. View the member list again and verify that the member was removed:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
sh-4.2# etcdctl member list -w table
|
||||
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
+------------------+---------+------------------------------+---------------------------+---------------------------+
|
||||
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS |
|
||||
+------------------+---------+------------------------------+---------------------------+---------------------------+
|
||||
@@ -86,9 +109,14 @@ If you are running installer-provisioned infrastructure, or you used the Machine
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get machines -n openshift-machine-api -o wide
|
||||
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
|
||||
clustername-8qw5l-master-0 Running m4.xlarge us-east-1 us-east-1a 3h37m ip-10-0-131-183.ec2.internal aws:///us-east-1a/i-0ec2782f8287dfb7e stopped <1>
|
||||
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
|
||||
@@ -101,6 +129,7 @@ clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us
|
||||
|
||||
.. Save the machine configuration to a file on your file system:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get machine clustername-8qw5l-master-0 \ <1>
|
||||
-n openshift-machine-api \
|
||||
@@ -180,6 +209,7 @@ metadata:
|
||||
|
||||
.. Delete the machine of the unhealthy member:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 <1>
|
||||
----
|
||||
@@ -187,9 +217,14 @@ $ oc delete machine -n openshift-machine-api clustername-8qw5l-master-0 <1>
|
||||
|
||||
.. Verify that the machine was deleted:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get machines -n openshift-machine-api -o wide
|
||||
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
|
||||
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
|
||||
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
|
||||
@@ -200,6 +235,7 @@ clustername-8qw5l-worker-us-east-1c-pkg26 Running m4.large us-east-1 us
|
||||
|
||||
.. Create the new machine using the `new-master-machine.yaml` file:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc apply -f new-master-machine.yaml
|
||||
----
|
||||
@@ -207,9 +243,14 @@ $ oc apply -f new-master-machine.yaml
|
||||
|
||||
.. Verify that the new machine has been created:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get machines -n openshift-machine-api -o wide
|
||||
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE
|
||||
clustername-8qw5l-master-1 Running m4.xlarge us-east-1 us-east-1b 3h37m ip-10-0-154-204.ec2.internal aws:///us-east-1b/i-096c349b700a19631 running
|
||||
clustername-8qw5l-master-2 Running m4.xlarge us-east-1 us-east-1c 3h37m ip-10-0-164-97.ec2.internal aws:///us-east-1c/i-02626f1dba9ed5bba running
|
||||
@@ -226,8 +267,14 @@ It might take a few minutes for the new machine to be created. The etcd cluster
|
||||
+
|
||||
In a terminal that has access to the cluster as a `cluster-admin` user, run the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ oc get pods -n openshift-etcd | grep etcd
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,terminal]
|
||||
----
|
||||
etcd-ip-10-0-133-53.ec2.internal 3/3 Running 0 7m49s
|
||||
etcd-ip-10-0-164-97.ec2.internal 3/3 Running 0 123m
|
||||
etcd-ip-10-0-154-204.ec2.internal 3/3 Running 0 124m
|
||||
|
||||
Reference in New Issue
Block a user