From 35600c4ceb091e26ededf6dc67a930a9d5174030 Mon Sep 17 00:00:00 2001 From: Laura Hinson Date: Fri, 26 Sep 2025 11:42:09 -0400 Subject: [PATCH] [OSDOCS-13615]: etcd performance troubleshooting section --- _topic_maps/_topic_map.yml | 2 +- etcd/etcd-performance.adoc | 42 ++++- modules/etcd-consensus-latency.adoc | 71 ++++++++ modules/etcd-database-size.adoc | 60 ++++++ ...d-determine-kube-api-transaction-rate.adoc | 30 +++ modules/etcd-disk-latency.adoc | 29 +++ .../etcd-leader-election-log-replication.adoc | 19 ++ modules/etcd-network-latency-jitter.adoc | 171 ++++++++++++++++++ modules/etcd-node-scaling.adoc | 5 + modules/etcd-peer-round-trip.adoc | 53 ++++++ modules/etcd-timer-tunables.adoc | 25 +++ 11 files changed, 504 insertions(+), 3 deletions(-) create mode 100644 modules/etcd-consensus-latency.adoc create mode 100644 modules/etcd-database-size.adoc create mode 100644 modules/etcd-determine-kube-api-transaction-rate.adoc create mode 100644 modules/etcd-disk-latency.adoc create mode 100644 modules/etcd-leader-election-log-replication.adoc create mode 100644 modules/etcd-network-latency-jitter.adoc create mode 100644 modules/etcd-peer-round-trip.adoc create mode 100644 modules/etcd-timer-tunables.adoc diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml index 6468c67fcd..1910890179 100644 --- a/_topic_maps/_topic_map.yml +++ b/_topic_maps/_topic_map.yml @@ -2482,7 +2482,7 @@ Topics: File: etcd-overview - Name: Recommended etcd practices File: etcd-practices -- Name: Performance considerations for etcd +- Name: Ensuring reliable etcd performance and scalability File: etcd-performance - Name: Backing up and restoring etcd data Dir: etcd-backup-restore diff --git a/etcd/etcd-performance.adoc b/etcd/etcd-performance.adoc index 48460d8055..cc40c2b4db 100644 --- a/etcd/etcd-performance.adoc +++ b/etcd/etcd-performance.adoc @@ -1,13 +1,22 @@ :_mod-docs-content-type: ASSEMBLY [id="etcd-performance"] include::_attributes/common-attributes.adoc[] -= Performance considerations for etcd += Ensuring reliable etcd performance and scalability :context: etcd-performance toc::[] -To ensure optimal performance and scalability for etcd in {product-title}, you can complete the following practices. +To ensure optimal performance with etcd, it's important to understand the conditions that affect performance, including node scaling, leader election, log replication, tuning, latency, network jitter, peer round trip time, database size, and Kubernetes API transaction rates. +// Leader election and log replication +include::modules/etcd-leader-election-log-replication.adoc[leveloffset=+1] + +[role="_additional-resources"] +.Additional resources +* link:https://etcd.io/docs/v3.5/learning/design-learner/[The etcd learner design] +* link:https://etcd.io/docs/v3.5/op-guide/failures/[Failure modes] + +//Node scaling for etcd include::modules/etcd-node-scaling.adoc[leveloffset=+1] [role="_additional-resources"] @@ -17,18 +26,47 @@ include::modules/etcd-node-scaling.adoc[leveloffset=+1] * link:https://docs.redhat.com/en/documentation/assisted_installer_for_openshift_container_platform/2024/html/installing_openshift_container_platform_with_the_assisted_installer/expanding-the-cluster#installing-control-plane-node-healthy-cluster_expanding-the-cluster[Expanding the cluster] * xref:../backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.adoc#dr-restoring-cluster-state[Restoring to a previous cluster state] +// Effects of disk latency on etcd +include::modules/etcd-disk-latency.adoc[leveloffset=+1] + +// Monitoring consensus latency for etcd +include::modules/etcd-consensus-latency.adoc[leveloffset=+1] + +//Moving etcd to a different disk include::modules/move-etcd-different-disk.adoc[leveloffset=+1] [role="_additional-resources"] .Additional resources * xref:../architecture/architecture-rhcos.adoc#architecture-rhcos[Red Hat Enterprise Linux CoreOS (RHCOS)] +//Defragmenting etcd data include::modules/etcd-defrag.adoc[leveloffset=+1] +//Setting tuning parameters for etcd include::modules/etcd-tuning-parameters.adoc[leveloffset=+1] [role="_additional-resources"] .Additional resources * xref:../nodes/clusters/nodes-cluster-enabling-features.adoc#nodes-cluster-enabling-features-about_nodes-cluster-enabling[Understanding feature gates] +// OCP timer tunables for etcd +include::modules/etcd-timer-tunables.adoc[leveloffset=+1] + +// Determinging the size of the etcd database and understanding its affects +include::modules/etcd-database-size.adoc[leveloffset=+1] + +//Increasing the database size for etcd include::modules/etcd-increase-db.adoc[leveloffset=+1] + +// Measuring network jitter between control plane nodes +include::modules/etcd-network-latency-jitter.adoc[leveloffset=+1] + +// How etcd peer round trip time affects performance +include::modules/etcd-peer-round-trip.adoc[leveloffset=+1] + +// Determining Kubernetes API transaction rate for your environment +include::modules/etcd-determine-kube-api-transaction-rate.adoc[leveloffset=+1] + +[role="_additional-resources"] +.Additional resources +* link:https://kube-burner.github.io/kube-burner-ocp/latest/[kube-burner-ocp documentation] \ No newline at end of file diff --git a/modules/etcd-consensus-latency.adoc b/modules/etcd-consensus-latency.adoc new file mode 100644 index 0000000000..66f306c122 --- /dev/null +++ b/modules/etcd-consensus-latency.adoc @@ -0,0 +1,71 @@ +// Module included in the following assemblies: +// +// * etcd/etcd-performance.adoc + +:_mod-docs-content-type: PROCEDURE +[id="etcd-consensus-latency_{context}"] += Monitoring consensus latency for etcd + +By using the `etcdctl` CLI, you can monitor the latency for reaching consensus as experienced by etcd. You must identify one of the etcd pods and then retrieve the endpoint health. + +This procedure, which validates and monitors cluster health, can be run only on an active cluster. + +.Prerequisites + +* During planning for cluster deployment, you completed the disk and network tests. + +.Procedure + +. Enter the following command: ++ +[source,terminal] +---- +# oc get pods -n openshift-etcd -l app=etcd +---- ++ +.Example output +[source,terminal] +---- +NAME READY STATUS RESTARTS AGE +etcd-m0 4/4 Running 4 8h +etcd-m1 4/4 Running 4 8h +etcd-m2 4/4 Running 4 8h +---- + +. Enter the following command. To better understand the etcd latency for consensus, you can run this command on a precise watch cycle for a few minutes to observe that the numbers remain below the ~66 ms threshold. The closer the consensus time is to 100 ms, the more likely the cluster will experience service-affecting events and instability. ++ +[source,terminal] +---- +# oc exec -ti etcd-m0 -- etcdctl endpoint health -w table +---- ++ +.Example output +[source,terminal] +---- ++----------------------------+--------+-------------+-------+ +| ENDPOINT | HEALTH | TOOK | ERROR | ++----------------------------+--------+-------------+-------+ +| https://198.18.111.12:2379 | true | 3.798349ms | | +| https://198.18.111.14:2379 | true | 7.389608ms | | +| https://198.18.111.13:2379 | true | 6.263117ms | | ++----------------------------+--------+-------------+-------+ +---- + +. Enter the following command: ++ +[source,terminal] +---- +# oc exec -ti etcd-m0 -- watch -dp -c etcdctl endpoint health -w table +---- ++ +.Example output +[source,terminal] +---- ++----------------------------+--------+-------------+-------+ +| ENDPOINT | HEALTH | TOOK | ERROR | ++----------------------------+--------+-------------+-------+ +| https://198.18.111.12:2379 | true | 9.533405ms | | +| https://198.18.111.13:2379 | true | 4.628054ms | | +| https://198.18.111.14:2379 | true | 5.803378ms | | ++----------------------------+--------+-------------+-------+ +---- \ No newline at end of file diff --git a/modules/etcd-database-size.adoc b/modules/etcd-database-size.adoc new file mode 100644 index 0000000000..4925a5def3 --- /dev/null +++ b/modules/etcd-database-size.adoc @@ -0,0 +1,60 @@ +// Module included in the following assemblies: +// +// * etcd/etcd-performance.adoc + +:_mod-docs-content-type: PROCEDURE +[id="etcd-database-size_{context}"] += Determining the size of the etcd database and understanding its effects + +The size of the etcd database has a direct impact on the time to complete the etcd defragmentation process. {product-title} automatically runs the etcd defragmentation on one etcd member at a time when it detects at least 45% fragmentation. During the defragmentation process, the etcd member cannot process any requests. On small etcd databases, the defragmentation process happens in less than a second. With larger etcd databases, the disk latency directly impacts the fragmentation time, causing additional latency, as operations are blocked while defragmentation happens. + +The size of the etcd database is a factor to consider when network partitions isolate a control plane node for a period and the control plane needs to resync after communication is re-established. + +Minimal options exist for controlling the size of the etcd database, as it depends on the operators and applications in the system. When you consider the latency range under which the system will operate, account for the effects of synchronization or defragmentation per size of the etcd database. + +The magnitude of the effects is specific to the deployment. The time to complete a defragmentation will cause degradation in the transaction rate, as the etcd member cannot accept updates during the defragmentation process. Similarly, the time for the etcd re-synchronization for large databases with high change rate affects the transaction rate and transaction latency on the system. + +Consider the following two examples for the type of impacts to plan for. + +Example of the effect of etcd defragementation based on database size:: Writing an etcd database of 1 GB to a slow 7200 RPMs disk at 80 Mbit/s takes about 1 minute and 40 seconds. In such a scenario, the defragmentation process takes at least this long, if not longer, to complete the defragmentation. + +Example of the effect of database size on etcd synchronization:: If there is a change of 10% of the etcd database during the disconnection of one of the control plane nodes, the resync needs to transfer at least 100 MB. Transferring 100 MB over a 1 Gbps link takes 800 ms. On clusters with regular transactions with the Kubernetes API, the larger the etcd database size, the more network instabilities will cause control plane instabilities. + +You can determine the size of an etcd database by using the {product-title} console or by running commands in the `etcdctl` tool. + +.Procedure + +* To find the database size in the {product-title} console, go to the *etcd* dashboard to view a plot that reports the size of the etcd database. + +* To find the database size by using the etcdctl tool, you can enter two commands: + +.. Enter the following command to list the pods: ++ +[source,terminal] +---- +# oc get pods -n openshift-etcd -l app=etcd +---- ++ +.Example output +[source,terminal] +---- +NAME READY STATUS RESTARTS AGE +etcd-m0 4/4 Running 4 22h +etcd-m1 4/4 Running 4 22h +etcd-m2 4/4 Running 4 22h +---- + +.. Enter the following command and view the database size in the output: ++ +[source,terminal] +---- +# oc exec -t etcd-m0 -- etcdctl endpoint status -w simple | cut -d, -f 1,3,4 +---- ++ +.Example output +[source,terminal] +---- +https://198.18.111.12:2379, 3.5.6, 1.1 GB +https://198.18.111.13:2379, 3.5.6, 1.1 GB +https://198.18.111.14:2379, 3.5.6, 1.1 GB +---- diff --git a/modules/etcd-determine-kube-api-transaction-rate.adoc b/modules/etcd-determine-kube-api-transaction-rate.adoc new file mode 100644 index 0000000000..09a5b9afcd --- /dev/null +++ b/modules/etcd-determine-kube-api-transaction-rate.adoc @@ -0,0 +1,30 @@ +// Module included in the following assemblies: +// +// * etcd/etcd-performance.adoc + +:_mod-docs-content-type: CONCEPT +[id="etcd-determine-kube-api-transaction-rate_{context}"] += Determining Kubernetes API transaction rate for your environment + +When you are using stretched control planes, the Kubernetes API transaction rate depends on the characteristics of the particular deployment. Specifically, it depends on the following combined factors: + +* The etcd disk latency +* The etcd round trip time +* The size of objects that are being written to the API + +As a result, when you use stretched control planes, cluster administrators must test the environment to determine the sustained transaction rate that is possible for the environment. The `kube-burner` tool is useful for that purpose. The binary includes a wrapper for testing OpenShift clusters: `kube-burner-ocp`. You can use `kube-burner-ocp` to test cluster or node density. To test the control plane, `kube-burner-ocp` has three workload profiles: cluster-density, cluster-density-v2, and cluster-density-ms. Each workload profile creates a series of resources that are designed to load the control plane. For more information about each profile, see the `kube-burner-ocp` workload documentation. + +.Procedure + +. Enter a command to create and delete resources. The following example shows a command that creates and deletes resources within 20 minutes: ++ +[source,terminal] +---- +# kube-burner ocp cluster-density-ms --churn-duration 20m --churn-delay 0s --iterations 10 --timeout 30m +---- + +. The {product-title} console provides a dashboard with all the relevant API performance information. To access API performance information, click *Observe* -> *Dashboards*, and from the *Dashboards* menu, click *API Performance*. + +. During the run, observe the API performance dashboard in the {product-title} console by clicking *Observe* -> *Dashboards*, and from the *Dashboards* menu, click *API Performance*. ++ +On the dashboard, notice how the control plane responds during load and the 99th percentile transaction rate it can achieve for the execution of various verbs and request rates by read and write. Use this information and the knowledge of your organization's workload to determine the load that the organization can put in the clusters for the specific stretched control plane deployment. \ No newline at end of file diff --git a/modules/etcd-disk-latency.adoc b/modules/etcd-disk-latency.adoc new file mode 100644 index 0000000000..5c9eec4d61 --- /dev/null +++ b/modules/etcd-disk-latency.adoc @@ -0,0 +1,29 @@ +// Module included in the following assemblies: +// +// * etcd/etcd-performance.adoc + +:_mod-docs-content-type: CONCEPT +[id="etcd-disk-latency_{context}"] += Effects of disk latency on etcd + +An etcd cluster is sensitive to disk latencies. To understand the disk latency that is experienced by etcd in your control plane environment, run the `fio` tests or suite. + +Make sure that the final report classifies the disk as appropriate for etcd, as shown in the following example: + +[source,terminal] +---- +... +99th percentile of fsync is 5865472 ns +99th percentile of the fsync is within the recommended threshold: - 20 ms, the disk can be used to host etcd +---- + +When a high latency disk is used, a message states that the disk is not recommended for etcd, as shown in the following example: + +[source,terminal] +---- +... +99th percentile of fsync is 15865472 ns +99th percentile of the fsync is greater than the recommended value which is 20 ms, faster disks are recommended to host etcd for better performance +---- + +When you use cluster deployments that span multiple data centers that are using disks for etcd that do not meet the recommended latency, it increases the chances of service-affecting failures and dramatically reduces the network latency that the control plane can sustain. \ No newline at end of file diff --git a/modules/etcd-leader-election-log-replication.adoc b/modules/etcd-leader-election-log-replication.adoc new file mode 100644 index 0000000000..2ecc505f57 --- /dev/null +++ b/modules/etcd-leader-election-log-replication.adoc @@ -0,0 +1,19 @@ +// Module included in the following assemblies: +// +// * etcd/etcd-performance.adoc + +:_mod-docs-content-type: CONCEPT +[id="etcd-leader-election-log-replication_{context}"] += Leader election and log replication of etcd + +etcd is a consistent, distributed key-value store that operates as a cluster of replicated nodes. Following the Raft algorithm, etcd operates by electing one node as the leader and the others as followers. The leader maintains the system's current state and ensures that the followers are up-to-date. + +The leader node is responsible for log replication. It handles incoming write transactions from the client and writes a Raft log entry that it then broadcasts to the followers. + +//diagram goes here + +When an etcd client such as `kube-apiserver` connects to an etcd member that is requesting an action that requires a quorum, such as writing a value, if the etcd member is a follower, it returns a message indicating the transaction should be sent to the leader. + +//second diagram goes here + +When the etcd client requests an action that requires a quorum from the leader, the leader keeps the client connection open while it writes the local Raft log, broadcasts the log to the followers, and waits for the majority of the followers to acknowledge to have committed the log without failures. Only then does the leader send the acknowledgment to the etcd client and close the session. If failure notifications are received from the followers and the majority fails to reach a consensus, the leader returns the error message to the client and closes the session. \ No newline at end of file diff --git a/modules/etcd-network-latency-jitter.adoc b/modules/etcd-network-latency-jitter.adoc new file mode 100644 index 0000000000..8302d788f6 --- /dev/null +++ b/modules/etcd-network-latency-jitter.adoc @@ -0,0 +1,171 @@ +// Module included in the following assemblies: +// +// * etcd/etcd-performance.adoc + +:_mod-docs-content-type: PROCEDURE +[id="etcd-network-latency-jitter_{context}"] += Measuring network jitter between control plane nodes + +//lahinson: The following line is in the original KCS article, but no "MTU discovery and validation section" is found, so I commented-out this line. + +//Use the tools that are described in the MTU discovery and validation section to obtain the average and maximum network latency. + +The value of the heartbeat interval should be around the maximum of the average round-trip time (RTT) between members, normally around 1.5 times the round-trip time. With the {product-title} default heartbeat interval of 100 ms, the recommended RTT between control plane nodes is less than approximately 33 ms with a maximum of less than 66 ms (66 ms multiplied by 1.5 equals 99 ms). For more information, see "Setting tuning parameters for etcd". Any network latency that is higher might cause service-affecting events and cluster instability. + +The network latency is influenced by many factors, including but not limited to the following factors: + +* The technology of the transport networks, such as copper, fiber, wireless, or satellite +* The number and quality of the network devices in the transport network + +A good evaluation reference is the comparison of the network latency in the organization with the commercial latencies that are published by telecommunications providers, such as monthly IP latency statistics. + +Consider network latency with network jitter for more accurate calculations. _Network jitter_ is the variance in network latency or, more specifically, the variation in the delay of received packets. On ideal network conditions, the jitter is as close to zero as possible. Network jitter affects the network latency calculations for etcd because the actual network latency over time will be the RTT plus or minus jitter. For example, a network with a maximum latency of 80 ms and jitter of 30 ms will experience latencies of 110 ms, which means etcd is missing heartbeats, causing request timeouts and temporary leader loss. During a leader loss and reelection, the Kubernetes API cannot process any request that causes a service-affecting event and instability of the cluster. + +It's important to measure the network jitter among all control plane nodes. To do so, you can use the `iPerf3` tool in UDP mode. + +.Prerequisite + +* You built your own iPerf image. For more information, see the following Red{nbsp}Hat Knowledgebase articles + +** link:https://access.redhat.com/articles/5233541[Testing Network Bandwidth in OpenShift using iPerf Container] +** link:https://access.redhat.com/solutions/6129701[How to run iPerf network performance test in OpenShift 4] + +.Procedure + +. Connect to one of the control plane nodes and run the iPerf container as iPerf server in host network mode. When you are running in server mode, the tool accepts TCP and UDP tests. Enter the following command, being careful to replace `` with your iPerf image: ++ +[source,terminal] +---- +# podman run -ti --rm --net host iperf3 -s +---- + +. Connect to another control plane node and run the iPerf in UDP client mode by entering the following command: ++ +[source,terminal] +---- +# podman run -ti --rm --net host iperf3 -u -c -t 300 +---- ++ +The default test runs for 10 seconds, and at the end, the client output shows the average jitter from the client perspective. + +. Run the debug node mode by entering the following command: ++ +[source,terminal] +---- +# oc debug node/m1 +---- ++ +.Example output +[source,terminal] +---- +Starting pod/m1-debug ... +To use host binaries, run `chroot /host` +Pod IP: 198.18.111.13 +If you don't see a command prompt, try pressing enter. +---- + +. Enter the following commands: ++ +[source,terminal] +---- +sh-4.4# chroot /host +---- ++ +[source,terminal] +---- +sh-4.4# podman run -ti --rm --net host iperf3 -u -c m0 +---- ++ +.Example output +[source,terminal] +---- +Connecting to host m0, port 5201 +[ 5] local 198.18.111.13 port 60878 connected to 198.18.111.12 port 5201 +[ ID] Interval Transfer Bitrate Total Datagrams +[ 5] 0.00-1.00 sec 129 KBytes 1.05 Mbits/sec 91 +[ 5] 1.00-2.00 sec 127 KBytes 1.04 Mbits/sec 90 +[ 5] 2.00-3.00 sec 129 KBytes 1.05 Mbits/sec 91 +[ 5] 3.00-4.00 sec 129 KBytes 1.05 Mbits/sec 91 +[ 5] 4.00-5.00 sec 127 KBytes 1.04 Mbits/sec 90 +[ 5] 5.00-6.00 sec 129 KBytes 1.05 Mbits/sec 91 +[ 5] 6.00-7.00 sec 127 KBytes 1.04 Mbits/sec 90 +[ 5] 7.00-8.00 sec 129 KBytes 1.05 Mbits/sec 91 +[ 5] 8.00-9.00 sec 127 KBytes 1.04 Mbits/sec 90 +[ 5] 9.00-10.00 sec 129 KBytes 1.05 Mbits/sec 91 +- - - - - - - - - - - - - - - - - - - - - - - - - +[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams +[ 5] 0.00-10.00 sec 1.25 MBytes 1.05 Mbits/sec 0.000 ms 0/906 (0%) sender +[ 5] 0.00-10.04 sec 1.25 MBytes 1.05 Mbits/sec 1.074 ms 0/906 (0%) receiver + +iperf Done. +---- + +. On the iPerf server, the output shows the jitter on every second interval. The average is shown at the end. For the purpose of this test, you want to identify the maximum jitter that is experienced during the test, ignoring the output of the first second as it might contain an invalid measurement. Enter the following command: ++ +[source,terminal] +---- +# oc debug node/m0 +---- ++ +.Example output +[source,terminal] +---- +Starting pod/m0-debug ... +To use host binaries, run `chroot /host` +Pod IP: 198.18.111.12 +If you don't see a command prompt, try pressing enter. +---- + +. Enter the following commands: ++ +[source,terminal] +---- +sh-4.4# chroot /host +---- ++ +[source,terminal] +---- +sh-4.4# podman run -ti --rm --net host iperf3 -s +---- ++ +.Example output +[source,terminal] +---- +----------------------------------------------------------- +Server listening on 5201 +----------------------------------------------------------- +Accepted connection from 198.18.111.13, port 44136 +[ 5] local 198.18.111.12 port 5201 connected to 198.18.111.13 port 60878 +[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams +[ 5] 0.00-1.00 sec 124 KBytes 1.02 Mbits/sec 4.763 ms 0/88 (0%) +[ 5] 1.00-2.00 sec 127 KBytes 1.04 Mbits/sec 4.735 ms 0/90 (0%) +[ 5] 2.00-3.00 sec 129 KBytes 1.05 Mbits/sec 0.568 ms 0/91 (0%) +[ 5] 3.00-4.00 sec 127 KBytes 1.04 Mbits/sec 2.443 ms 0/90 (0%) +[ 5] 4.00-5.00 sec 129 KBytes 1.05 Mbits/sec 1.372 ms 0/91 (0%) +[ 5] 5.00-6.00 sec 127 KBytes 1.04 Mbits/sec 2.769 ms 0/90 (0%) +[ 5] 6.00-7.00 sec 129 KBytes 1.05 Mbits/sec 2.393 ms 0/91 (0%) +[ 5] 7.00-8.00 sec 127 KBytes 1.04 Mbits/sec 0.883 ms 0/90 (0%) +[ 5] 8.00-9.00 sec 129 KBytes 1.05 Mbits/sec 0.594 ms 0/91 (0%) +[ 5] 9.00-10.00 sec 127 KBytes 1.04 Mbits/sec 0.953 ms 0/90 (0%) +[ 5] 10.00-10.04 sec 5.66 KBytes 1.30 Mbits/sec 1.074 ms 0/4 (0%) +- - - - - - - - - - - - - - - - - - - - - - - - - +[ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams +[ 5] 0.00-10.04 sec 1.25 MBytes 1.05 Mbits/sec 1.074 ms 0/906 (0%) receiver +----------------------------------------------------------- +Server listening on 5201 +----------------------------------------------------------- +---- + +. Add the calculated jitter as a penalty to the network latency. For example, if the network latency is 80 ms and the jitter is 30 ms, consider an effective network latency of 110 ms for the purposes of the control plane. In this example, that value goes above the 100 ms threshold, and the system will miss heartbeats. + +. When you calculate the network latency for etcd, use the effective network latency, which is the sum of the following equation: ++ +RTT + jitter ++ +You might be able to use the average jitter value to calculate the penalty, but the cluster can sporadically miss heartbeats if the etcd heartbeat timer is lower than the sum of the following equation: ++ +RTT + max(jitter) ++ +Instead, consider using the 99th percentile or max jitter value for a more resilient deployment: ++ +Effective Network Latency = RTT + max(jitter) \ No newline at end of file diff --git a/modules/etcd-node-scaling.adoc b/modules/etcd-node-scaling.adoc index 6ae21dee26..4e8362af41 100644 --- a/modules/etcd-node-scaling.adoc +++ b/modules/etcd-node-scaling.adoc @@ -14,6 +14,11 @@ Scaling a cluster to 4 or 5 control plane nodes is available only on bare metal For more information about how to scale control plane nodes by using the Assisted Installer, see "Adding hosts with the API" and "Replacing a control plane node in a healthy cluster". +[NOTE] +==== +While adding control plane nodes can increase reliability and availability, it can decrease throughput and increase latency, affecting performance. +==== + The following table shows failure tolerance for clusters of different sizes: .Failure tolerances by cluster size diff --git a/modules/etcd-peer-round-trip.adoc b/modules/etcd-peer-round-trip.adoc new file mode 100644 index 0000000000..dd7759c4ce --- /dev/null +++ b/modules/etcd-peer-round-trip.adoc @@ -0,0 +1,53 @@ +// Module included in the following assemblies: +// +// * etcd/etcd-performance.adoc + +:_mod-docs-content-type: CONCEPT +[id="etcd-peer-round-trip_{context}"] += How etcd peer round trip time affects performance + +The etcd peer round trip time is an end-to-end test metric on how quickly something can be replicated among members. It shows the latency of etcd to finish replicating a client request among all the etcd members. The etcd peer round trip time is not the same thing as the network round trip time. + +You can monitor various etcd metrics on dashboards in the {product-title} console. In the console, click *Observe* -> *Dashboards* and from the dropdown list, select *etcd*. + +Near the end of the *etcd* dashboard, you can find a plot that summarizes the etcd peer round trip time. + +[NOTE] +==== +These etcd metrics are collected by the OpenShift metrics system in Prometheus. You can access them from the CLI by following the Red{nbsp}Hat Knowledgebase solution, link:https://access.redhat.com/solutions/5151831[How to query from the command line Prometheus statistics]. +==== + +[source,terminal] +---- +# Get token to connect to Prometheus +SECRET=$(oc get secret -n openshift-user-workload-monitoring | grep prometheus-user-workload-token | head -n 1 | awk '{print $1 }') +export TOKEN=$(oc get secret $SECRET -n openshift-user-workload-monitoring -o json | jq -r '.data.token' | base64 -d) +export THANOS_QUERIER_HOST=$(oc get route thanos-querier -n openshift-monitoring -o json | jq -r '.spec.host') +---- + +Queries must be URL-encoded. The following example shows how to retrieve the metrics that are reporting the round trip time (in seconds) for etcd to finish replicating the client requests among the members: + +[source,terminal] +---- +# prometheus query +query="histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))" + +# urlencoded query +encoded_query=$(printf "%s" $query |jq -sRr @uri) + +# querying the OpenShift metrics service +curl -s -X GET -k -H "Authorization: Bearer $TOKEN" "https://$THANOS_QUERIER_HOST/api/v1/query?query=$encoded_query" | jq '.data.result[] | .metric.pod,.value[1]' + +"etcd-m2" +"0.09318400000000004" # example ~93ms +"etcd-m0" +"0.050688" # example ~51ms +"etcd-m1" +"0.050688" # example ~51ms +---- + +The following metrics are also relevant to understanding etcd performance: + +etcd_disk_wal_fsync_duration_seconds_bucket:: Reports the etcd WAL fsync duration. +etcd_disk_backend_commit_duration_seconds_bucket:: Reports the etcd backend commit latency duration. +etcd_server_leader_changes_seen_total:: Reports the leader changes. \ No newline at end of file diff --git a/modules/etcd-timer-tunables.adoc b/modules/etcd-timer-tunables.adoc new file mode 100644 index 0000000000..ae404158f6 --- /dev/null +++ b/modules/etcd-timer-tunables.adoc @@ -0,0 +1,25 @@ +// Module included in the following assemblies: +// +// * etcd/etcd-performance.adoc + +:_mod-docs-content-type: CONCEPT +[id="etcd-timer-tunables_{context}"] += {product-title} timer tunables for etcd + +{product-title} maintains etcd timers that are optimized for each platform. {product-title} has prescribed validated values that are optimized for each platform provider. The default etcd timers with `platform=none` or `platform=metal` are as follows: + +[source,terminal] +---- +- name: ETCD_ELECTION_TIMEOUT + value: "1000" + ... +- name: ETCD_HEARTBEAT_INTERVAL + value: "100" +---- + +From an etcd perspective, the two key values are election timeout and heartbeat interval: + +Heartbeat interval:: The frequency with which the leader notifies followers that it is still the leader. +Election timeout:: This timeout is how long a follower node will go without hearing a heartbeat before it attempts to become leader itself. + +These values do not provide the whole story for the control plane or even etcd. An etcd cluster is sensitive to disk latencies. Because etcd must persist proposals to its log, disk activity from other processes might cause long fsync latencies. The consequence is that etcd might miss heartbeats, causing request timeouts and temporary leader loss. During a leader loss and reelection, the Kubernetes API cannot process any request that causes a service-affecting event and instability of the cluster.