mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-05 12:46:18 +01:00
331 lines
9.1 KiB
Plaintext
331 lines
9.1 KiB
Plaintext
:_mod-docs-content-type: PROCEDURE
|
|
[id="installation-replacing-control-plane-nodes_{context}"]
|
|
= Replacing control plane nodes in a two-node OpenShift cluster with fencing
|
|
|
|
You can replace a failed control plane node in a two-node OpenShift cluster. The replacement node must use the same host name and IP address as the failed node.
|
|
|
|
.Prerequisites
|
|
|
|
* You have a functioning survivor control plane node.
|
|
* You have verified that either the machine is not running or the node is not ready.
|
|
* You have access to the cluster as a user with the `cluster-admin` role.
|
|
* You know the host name and IP address of the failed node.
|
|
|
|
[NOTE]
|
|
====
|
|
Do an etcd backup before proceeding to ensure that you can restore the cluster if any issues occur.
|
|
====
|
|
|
|
.Procedure
|
|
|
|
. Check the quorum state by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ sudo pcs quorum status
|
|
----
|
|
+
|
|
.Example output
|
|
[source,terminal]
|
|
----
|
|
Quorum information
|
|
------------------
|
|
Date: Fri Oct 3 14:15:31 2025
|
|
Quorum provider: corosync_votequorum
|
|
Nodes: 2
|
|
Node ID: 1
|
|
Ring ID: 1.16
|
|
Quorate: Yes
|
|
|
|
Votequorum information
|
|
----------------------
|
|
Expected votes: 2
|
|
Highest expected: 2
|
|
Total votes: 2
|
|
Quorum: 1
|
|
Flags: 2Node Quorate WaitForAll
|
|
|
|
Membership information
|
|
----------------------
|
|
Nodeid Votes Qdevice Name
|
|
1 1 NR master-0 (local)
|
|
2 1 NR master-1
|
|
----
|
|
|
|
.. If quorum is lost and one control plane node is still running, restore quorum manually on the survivor node by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ sudo pcs quorum unblock
|
|
----
|
|
|
|
.. If only one node failed, verify that etcd is running on the survivor node by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ sudo pcs resource status etcd
|
|
----
|
|
|
|
.. If etcd is not running, restart etcd by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ sudo pcs resource cleanup etcd
|
|
----
|
|
+
|
|
If etcd still does not start, force it manually on the survivor node, skipping fencing:
|
|
+
|
|
[IMPORTANT]
|
|
====
|
|
Before running this commands, ensure that the node being replaced is inaccessible. Otherwise, you risk etcd corruption.
|
|
====
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ sudo pcs resource debug-stop etcd
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ sudo OCF_RESKEY_CRM_meta_notify_start_resource='etcd' pcs resource debug-start etcd
|
|
----
|
|
+
|
|
After recovery, etcd must be running successfully on the survivor node.
|
|
|
|
. Delete etcd secrets for the failed node by running the following commands:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc project openshift-etcd
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete secret etcd-peer-<node_name>
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete secret etcd-serving-<node_name>
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete secret etcd-serving-metrics-<node_name>
|
|
----
|
|
+
|
|
[NOTE]
|
|
====
|
|
To replace the failed node, you must delete its etcd secrets first. When etcd is running, it might take some time for the API server to respond to these commands.
|
|
====
|
|
|
|
. Delete resources for the failed node:
|
|
|
|
.. If you have the `BareMetalHost` (BMH) objects, list them to identify the host you are replacing by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get bmh -n openshift-machine-api
|
|
----
|
|
|
|
.. Delete the BMH object for the failed node by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete bmh/<bmh_name> -n openshift-machine-api
|
|
----
|
|
|
|
.. List the `Machine` objects to identify the object that maps to the node that you are replacing by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get machines.machine.openshift.io -n openshift-machine-api
|
|
----
|
|
|
|
.. Get the label with the machine hash value from the `Machine` object by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get machines.machine.openshift.io/<machine_name> -n openshift-machine-api \
|
|
-o jsonpath='Machine hash label: {.metadata.labels.machine\.openshift\.io/cluster-api-cluster}{"\n"}'
|
|
----
|
|
+
|
|
Replace `<machine_name>` with the name of a `Machine` object in your cluster. For example, `ostest-bfs7w-ctrlplane-0`.
|
|
+
|
|
You need this label to provision a new `Machine` object.
|
|
|
|
.. Delete the `Machine` object for the failed node by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete machines.machine.openshift.io/<machine_name>-<failed nodename> -n openshift-machine-api
|
|
----
|
|
+
|
|
[NOTE]
|
|
====
|
|
The node object is deleted automatically after deleting the `Machine` object.
|
|
====
|
|
|
|
. Recreate the failed host by using the same name and IP address:
|
|
+
|
|
[IMPORTANT]
|
|
====
|
|
You must perform this step only if you are using installer-provisioned infrastructure or the Machine API to create the original node.
|
|
For information about replacing a failed bare-metal control plane node, see "Replacing an unhealthy etcd member on bare metal".
|
|
====
|
|
|
|
.. Remove the BMH and `Machine` objects. The machine controller automatically deletes the node object.
|
|
|
|
.. Provision a new machine by using the following sample configuration:
|
|
+
|
|
.Example `Machine` object configuration
|
|
[source,yaml]
|
|
----
|
|
apiVersion: machine.openshift.io/v1beta1
|
|
kind: Machine
|
|
metadata:
|
|
annotations:
|
|
metal3.io/BareMetalHost: openshift-machine-api/{bmh_name}
|
|
finalizers:
|
|
- machine.machine.openshift.io
|
|
labels:
|
|
machine.openshift.io/cluster-api-cluster: {machine_hash_label}
|
|
machine.openshift.io/cluster-api-machine-role: master
|
|
machine.openshift.io/cluster-api-machine-type: master
|
|
name: {machine_name}
|
|
namespace: openshift-machine-api
|
|
spec:
|
|
authoritativeAPI: MachineAPI
|
|
metadata: {}
|
|
providerSpec:
|
|
value:
|
|
apiVersion: baremetal.cluster.k8s.io/v1alpha1
|
|
customDeploy:
|
|
method: install_coreos
|
|
hostSelector: {}
|
|
image:
|
|
checksum: ""
|
|
url: ""
|
|
kind: BareMetalMachineProviderSpec
|
|
metadata:
|
|
creationTimestamp: null
|
|
userData:
|
|
name: master-user-data-managed
|
|
----
|
|
+
|
|
* `metadata.annotations.metal3.io/BareMetalHost`: Replace `{bmh_name}` with the name of the BMH object that is associated with the host that you are replacing.
|
|
* `labels.machine.openshift.io/cluster-api-cluster`: Replace `{machine_hash_label}` with the label that you fetched from the machine you deleted.
|
|
* `metadata.name`: Replace `{machine_name}` with the name of the machine you deleted.
|
|
|
|
.. Create the new BMH object and the secret to store the BMC credentials by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
cat <<EOF | oc apply -f -
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: <secret_name>
|
|
namespace: openshift-machine-api
|
|
data:
|
|
password: <password>
|
|
username: <username>
|
|
type: Opaque
|
|
---
|
|
apiVersion: metal3.io/v1alpha1
|
|
kind: BareMetalHost
|
|
metadata:
|
|
name: {bmh_name}
|
|
namespace: openshift-machine-api
|
|
spec:
|
|
automatedCleaningMode: disabled
|
|
bmc:
|
|
address: <redfish_url>/{uuid}
|
|
credentialsName: <name>
|
|
disableCertificateVerification: true
|
|
bootMACAddress: {boot_mac_address}
|
|
bootMode: UEFI
|
|
externallyProvisioned: false
|
|
online: true
|
|
rootDeviceHints:
|
|
deviceName: /dev/disk/by-id/scsi-<serial_number>
|
|
userData:
|
|
name: master-user-data-managed
|
|
namespace: openshift-machine-api
|
|
EOF
|
|
----
|
|
+
|
|
* `metadata.name`: Specify the name of the secret.
|
|
* `metadata.name`: Replace `{bmh_name}` with the name of the BMH object that you deleted.
|
|
* `bmc.address`: Replace `{uuid}` with the UUID of the node that you created.
|
|
* `bmc.credentialsName`: Replace `name` with the name of the secret that you created.
|
|
* `bootMACAddress`: Specify the MAC address of the provisioning network interface. This is the MAC address the node uses to identify itself when communicating with Ironic during provisioning.
|
|
|
|
. Verify that the new node has reached the `Provisioned` state by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get bmh -o wide
|
|
----
|
|
+
|
|
The value of the `STATUS` column in the output of this command must be `Provisioned`.
|
|
+
|
|
[NOTE]
|
|
====
|
|
The provisioning process can take 10 to 20 minutes to complete.
|
|
====
|
|
|
|
. Verify that both control plane nodes are in the `Ready` state by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc get nodes
|
|
----
|
|
+
|
|
The value of the `STATUS` column in the output of this command must be `Ready` for both nodes.
|
|
|
|
. Apply the `detached` annotation to the BMH object to prevent the Machine API from managing it by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc annotate bmh <bmh_name> -n openshift-machine-api baremetalhost.metal3.io/detached='' --overwrite
|
|
----
|
|
|
|
. Rejoin the replacement node to the pacemaker cluster by running the following command:
|
|
+
|
|
[NOTE]
|
|
====
|
|
Run the following command on the survivor control plane node, not the node being replaced.
|
|
====
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ sudo pcs cluster node remove <node_name>
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ sudo pcs cluster node add <node_name> addr=<node_ip> --start --enable
|
|
----
|
|
|
|
. Delete stale jobs for the failed node by running the following command:
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc project openshift-etcd
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete job tnf-auth-job-<node_name>
|
|
----
|
|
+
|
|
[source,terminal]
|
|
----
|
|
$ oc delete job tnf-after-setup-job-<node_name>
|
|
----
|
|
|
|
.Verification
|
|
|
|
For information about verifying that both control plane nodes and etcd are operating correctly, see "Verifying etcd health in a two-node OpenShift cluster with fencing".
|