mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-07 00:48:01 +01:00
This commit: - Adds recommendations around disk requirements for etcd and the critical metrics to monitor for etcd performance. - Adds instructions on running fio - a I/O benchmarking tool to help with determining whether the hardware is fast enough for hosting etcd.
57 lines
2.9 KiB
Plaintext
57 lines
2.9 KiB
Plaintext
// Module included in the following assemblies:
|
|
//
|
|
// * scalability_and_performance/recommended-host-practices.adoc
|
|
// * post_installation_configuration/cluster-tasks.adoc
|
|
// * post_installation_configuration/node-tasks.adoc
|
|
|
|
[id="recommended-etcd-practices_{context}"]
|
|
= Recommended etcd practices
|
|
|
|
For large and dense clusters, etcd can suffer from poor performance
|
|
if the keyspace grows excessively large and exceeds the space quota.
|
|
Periodic maintenance of etcd, including defragmentation, must be performed
|
|
to free up space in the data store. It is highly recommended that you monitor
|
|
Prometheus for etcd metrics and defragment it when required before etcd raises
|
|
a cluster-wide alarm that puts the cluster into a maintenance mode, which
|
|
only accepts key reads and deletes. Some of the key metrics to monitor are
|
|
`etcd_server_quota_backend_bytes` which is the current quota limit,
|
|
`etcd_mvcc_db_total_size_in_use_in_bytes` which indicates the actual
|
|
database usage after a history compaction, and
|
|
`etcd_debugging_mvcc_db_total_size_in_bytes` which shows the database size
|
|
including free space waiting for defragmentation.
|
|
|
|
Etcd writes data to disk, so its performance strongly depends on disk performance. Etcd
|
|
persists proposals on disk. Slow disks and disk activity from other processes might cause long
|
|
fsync latencies, causing etcd to miss heartbeats, inability to commit new proposals to the disk
|
|
on time, which can cause request timeouts and temporary leader loss. It is highly recommended to
|
|
run etcd on machines backed by SSD/NVMe disks with low latency and high throughput.
|
|
|
|
Some of the key metrics to monitor on a deployed {product-title} cluster
|
|
are p99 of etcd disk write ahead log duration and the number of etcd leader changes.
|
|
Use Prometheus to track these metrics. `etcd_disk_wal_fsync_duration_seconds_bucket`
|
|
reports the etcd disk fsync duration, `etcd_server_leader_changes_seen_total` reports
|
|
the leader changes. To rule out a slow disk and confirm that the disk is reasonably fast,
|
|
99th percentile of the `etcd_disk_wal_fsync_duration_seconds_bucket` should be less than 10ms.
|
|
|
|
Fio, a I/O benchmarking tool can be used to validate the hardware for etcd before or after
|
|
creating the OpenShift cluster. Run fio and analyze the results:
|
|
|
|
Assuming container runtimes like podman or docker are installed on the machine under test and
|
|
the path etcd writes the data exists - /var/lib/etcd, run:
|
|
|
|
.Procedure
|
|
Run the following if using podman:
|
|
[source,terminal]
|
|
----
|
|
$ sudo podman run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf
|
|
----
|
|
|
|
Alternatively, run the following if using docker:
|
|
[source,terminal]
|
|
----
|
|
$ sudo docker run --volume /var/lib/etcd:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf
|
|
----
|
|
|
|
The output reports whether the disk is fast enough to host etcd by comparing the 99th percentile
|
|
of the fsync metric captured from the run to see if it is less than 10ms.
|