openshift-docs/modules/etcd-deployment-caveats-span.adoc

// Module included in the following assemblies:
//
// * etcd/etcd-guidance-span.adoc

:_mod-docs-content-type: REFERENCE
[id="deployment-caveats-span_{context}"]
= Deployment caveats for spanned clusters

The guidance provided in this documentation focuses on general aspects of a cluster deployment that spans data centers. Some caveats to remember:

* Although the designs for deployments that span data centers are not bound by any special support requirements, these clusters do have additional inherent complexities that can require additional consideration or support involvement (time to identify, remediate and resolve issues) when compared to a standard single-site cluster.
* Applications might not work well or not work at all in clusters with high Kube API latency or low transaction rates.
* Layered products, such as storage providers, have lower latency requirements. In those cases, the latency limits are dictated by the architectures that are supported by the layered product.
* The failure scenarios are amplified with stretched control planes, and the way they are affected is specific to the deployment. Because of this, before using a deployment that spans data centers on a production environment, the organization should test and document the behavior of the cluster during disruptions such as:
** When there is a network partition leaving one, two, or all control plane nodes isolated
** When there are MTU mismatches on the transport network among the control plane nodes
** When there is a sustained spike in latency as a Day 2 event towards one or more of the control plane nodes
** When there is a considerable change in jitter due to network congestion, misconfiguration, or lack of QoS, an intermediate network device causing packet errors, and others
* Clusters deployed across many sites, network infrastructures, storage infrastructures, or other components inherently have a higher number of points of failure. Network disruptions or splits become a larger threat to such clusters especially, putting the nodes at risk of losing contact with each other. These multisite clusters must be designed with the potential for such failures in mind. Organizations deploying multisite clusters should extensively test failure scenarios, and should consider whether the cluster has protection from all points of failure. Consult with Red Hat Support for assistance in considering the important aspects of a resilient High Availability cluster design.
* In some cases, GEO awareness is a requirement or issue that must be solved to minimize latency, so a proper implementation of a Global Service Load Balancing (GSLB) method must be available.