mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-05 12:46:18 +01:00
TELCODOCS-2247-core updating modules from gitlab 419 core
This commit is contained in:
committed by
openshift-cherrypick-robot
parent
ee190e00e6
commit
f6a7902189
Binary file not shown.
|
Before Width: | Height: | Size: 82 KiB After Width: | Height: | Size: 106 KiB |
@@ -15,12 +15,10 @@ Setting these parameters manually is not supported. Incorrect parameter settings
|
||||
|
||||
All worker latency profiles configure the following parameters:
|
||||
|
||||
--
|
||||
node-status-update-frequency:: Specifies how often the kubelet posts node status to the API server.
|
||||
node-monitor-grace-period:: Specifies the amount of time in seconds that the Kubernetes Controller Manager waits for an update from a kubelet before marking the node unhealthy and adding the `node.kubernetes.io/not-ready` or `node.kubernetes.io/unreachable` taint to the node.
|
||||
default-not-ready-toleration-seconds:: Specifies the amount of time in seconds after marking a node unhealthy that the Kube API Server Operator waits before evicting pods from that node.
|
||||
default-unreachable-toleration-seconds:: Specifies the amount of time in seconds after marking a node unreachable that the Kube API Server Operator waits before evicting pods from that node.
|
||||
--
|
||||
|
||||
The following Operators monitor the changes to the worker latency profiles and respond accordingly:
|
||||
|
||||
|
||||
@@ -7,17 +7,19 @@
|
||||
= About the telco core cluster use model
|
||||
|
||||
The telco core cluster use model is designed for clusters that run on commodity hardware.
|
||||
Telco core clusters support large scale telco applications including control plane functions such as signaling, aggregation, and session border controller (SBC); and centralized data plane functions such as 5G user plane functions (UPF).
|
||||
Telco core clusters support large scale telco applications including control plane functions like signaling, aggregation, session border controller (SBC), and centralized data plane functions such as 5G user plane functions (UPF).
|
||||
Telco core cluster functions require scalability, complex networking support, resilient software-defined storage, and support performance requirements that are less stringent and constrained than far-edge RAN deployments.
|
||||
|
||||
.Telco core RDS cluster service-based architecture and networking topology
|
||||
image::openshift-5g-core-cluster-architecture-networking.png[5G core cluster showing a service-based architecture with overlaid networking topology]
|
||||
|
||||
Networking requirements for telco core functions vary widely across a range of networking features and performance points.
|
||||
IPv6 is a requirement and dual-stack is common.
|
||||
Some functions need maximum throughput and transaction rate and require support for user-plane DPDK networking.
|
||||
Other functions use more typical cloud-native patterns and can rely on OVN-Kubernetes, kernel networking, and load balancing.
|
||||
|
||||
Telco core clusters are configured as standard with three control plane and two or more worker nodes configured with the stock (non-RT) kernel.
|
||||
Telco core clusters are configured as standard with three control plane and one or more worker nodes configured with the stock (non-RT) kernel.
|
||||
In support of workloads with varying networking and performance requirements, you can segment worker nodes by using `MachineConfigPool` custom resources (CR), for example, for non-user data plane or high-throughput use cases.
|
||||
In support of required telco operational features, core clusters have a standard set of Day 2 OLM-managed Operators installed.
|
||||
|
||||
|
||||
.Telco core RDS cluster service-based architecture and networking topology
|
||||
image::openshift-5g-core-cluster-architecture-networking.png[5G core cluster showing a service-based architecture with overlaid networking topology]
|
||||
|
||||
|
||||
@@ -9,3 +9,4 @@ You can use other storage solutions to provide persistent storage for telco core
|
||||
The configuration and integration of these solutions is outside the scope of the reference design specifications (RDS).
|
||||
|
||||
Integration of the storage solution into the telco core cluster must include proper sizing and performance analysis to ensure the storage meets overall performance and resource usage requirements.
|
||||
|
||||
|
||||
@@ -7,27 +7,27 @@
|
||||
= Agent-based Installer
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* No reference design updates in this release.
|
||||
|
||||
Description::
|
||||
+
|
||||
--
|
||||
Telco core clusters can be installed by using the Agent-based Installer.
|
||||
This method allows you to install OpenShift on bare-metal servers without requiring additional servers or VMs for managing the installation.
|
||||
Telco core clusters can be installed using the Agent-based Installer.
|
||||
This method allows you to install {product-title} on bare-metal servers without requiring additional servers or VMs for managing the installation.
|
||||
The Agent-based Installer can be run on any system (for example, from a laptop) to generate an ISO installation image.
|
||||
The ISO is used as the installation media for the cluster supervisor nodes.
|
||||
Installation progress can be monitored using the ABI tool from any system with network connectivity to the supervisor node's API interfaces.
|
||||
Progress can be monitored using the Agent-based Installer from any system with network connectivity to the supervisor node's API interfaces.
|
||||
|
||||
ABI supports the following:
|
||||
Agent-based Installer supports the following:
|
||||
|
||||
* Installation from declarative CRs
|
||||
* Installation in disconnected environments
|
||||
* Installation with no additional supporting install or bastion servers required to complete the installation
|
||||
* Installation from declarative CRs.
|
||||
* Installation in disconnected environments.
|
||||
* Installation without the use of additional servers to support installation, for example, the bastion node.
|
||||
--
|
||||
|
||||
Limits and requirements::
|
||||
* Disconnected installation requires a registry that is reachable from the installed host, with all required content mirrored in that registry.
|
||||
* Disconnected installation requires a registry with all required content mirrored and reachable from the installed host.
|
||||
|
||||
Engineering considerations::
|
||||
* Networking configuration should be applied as NMState configuration during installation.
|
||||
Day 2 networking configuration using the NMState Operator is not supported.
|
||||
* Networking configuration should be applied as NMState configuration during installation. Day 2 networking configuration using the NMState Operator is not supported.
|
||||
|
||||
|
||||
@@ -20,8 +20,7 @@ Engineering considerations::
|
||||
--
|
||||
Use the following information to plan telco core workloads and cluster resources:
|
||||
|
||||
include::snippets/nodes-cgroup-vi-removed.adoc[]
|
||||
|
||||
* As of {product-title} 4.19, cgroup v1 is no longer supported and has been removed. All workloads must now be compatible with cgroup v2. For more information, see link:https://www.redhat.com/en/blog/rhel-9-changes-context-red-hat-openshift-workloads[Red Hat Enterprise Linux 9 changes in the context of Red Hat OpenShift workloads].
|
||||
* CNF applications should conform to the latest version of https://redhat-best-practices-for-k8s.github.io/guide/[Red Hat Best Practices for Kubernetes].
|
||||
* Use a mix of best-effort and burstable QoS pods as required by your applications.
|
||||
** Use guaranteed QoS pods with proper configuration of reserved or isolated CPUs in the `PerformanceProfile` CR that configures the node.
|
||||
@@ -34,6 +33,6 @@ Use other probe implementations, for example, `httpGet` or `tcpSocket`.
|
||||
** When you need to use exec probes, limit the exec probe frequency and quantity.
|
||||
The maximum number of exec probes must be kept below 10, and the frequency must not be set to less than 10 seconds.
|
||||
** You can use startup probes, because they do not use significant resources at steady-state operation.
|
||||
This limitation on exec probes applies primarily to liveness and readiness probes.
|
||||
The limitation on exec probes applies primarily to liveness and readiness probes.
|
||||
Exec probes cause much higher CPU usage on management cores compared to other probe types because they require process forking.
|
||||
--
|
||||
--
|
||||
@@ -8,16 +8,11 @@
|
||||
|
||||
* Cluster workloads are detailed in "Application workloads".
|
||||
* Worker nodes should run on either of the following CPUs:
|
||||
** Intel 3rd Generation Xeon (IceLake) CPUs or better when supported by {product-title}, or CPUs with the silicon security bug (Spectre and similar) mitigations turned off.
|
||||
Skylake and older CPUs can experience 40% transaction performance drops when Spectre and similar mitigations are enabled.
|
||||
** AMD EPYC Zen 4 CPUs (Genoa, Bergamo, or newer) or better when supported by {product-title}.
|
||||
+
|
||||
[NOTE]
|
||||
====
|
||||
Currently, per-pod power management is not available for AMD CPUs.
|
||||
====
|
||||
** Intel 3rd Generation Xeon (IceLake) CPUs or newer when supported by {product-title}, or CPUs with the silicon security bug (Spectre and similar) mitigations turned off.
|
||||
Skylake and older CPUs can experience 40% transaction performance drops when Spectre and similar mitigations are enabled. When Skylake and older CPUs change power states, this can cause latency.
|
||||
** AMD EPYC Zen 4 CPUs (Genoa, Bergamo).
|
||||
** IRQ balancing is enabled on worker nodes.
|
||||
The `PerformanceProfile` CR sets `globallyDisableIrqLoadBalancing` to false.
|
||||
The `PerformanceProfile` CR sets the `globallyDisableIrqLoadBalancing` parameter to a value of `false`.
|
||||
Guaranteed QoS pods are annotated to ensure isolation as described in "CPU partitioning and performance tuning".
|
||||
|
||||
* All cluster nodes should have the following features:
|
||||
@@ -37,7 +32,7 @@ See "CPU partitioning and performance tuning" for additional considerations.
|
||||
* CPU requirements for {product-title} depend on the configured feature set and application workload characteristics.
|
||||
For a cluster configured according to the reference configuration running a simulated workload of 3000 pods as created by the kube-burner node-density test, the following CPU requirements are validated:
|
||||
** The minimum number of reserved CPUs for control plane and worker nodes is 2 CPUs (4 hyper-threads) per NUMA node.
|
||||
** The NICs used for non-DPDK network traffic should be configured to use at least 16 RX/TX queues.
|
||||
** The NICs used for non-DPDK network traffic should be configured to use at most 32 RX/TX queues.
|
||||
** Nodes with large numbers of pods or other resources might require additional reserved CPUs.
|
||||
The remaining CPUs are available for user workloads.
|
||||
|
||||
@@ -46,3 +41,4 @@ The remaining CPUs are available for user workloads.
|
||||
====
|
||||
Variations in {product-title} configuration, workload size, and workload characteristics require additional analysis to determine the effect on the number of required CPUs for the OpenShift platform.
|
||||
====
|
||||
|
||||
|
||||
@@ -38,7 +38,9 @@ Review the source code for more details:
|
||||
* Clusters with single-stack IP configuration are not validated.
|
||||
* The `reachabilityTotalTimeoutSeconds` parameter in the `Network` CR configures the `EgressIP` node reachability check total timeout in seconds.
|
||||
The recommended value is `1` second.
|
||||
* Pod-level SR-IOV bonding mode must be set to `active-backup` and a value in `miimon` must be set (`100` is recommended).
|
||||
|
||||
Engineering considerations::
|
||||
* Pod egress traffic is handled by kernel routing table using the `routingViaHost` option.
|
||||
Appropriate static routes must be configured in the host.
|
||||
|
||||
|
||||
@@ -17,7 +17,7 @@ Telco core clusters conform to the following requirements:
|
||||
* Multiple machine config pools
|
||||
|
||||
Storage::
|
||||
Telco core use cases require persistent storage as provided by {rh-storage-first}.
|
||||
Telco core use cases require persistent storage as provided by {rh-storage}.
|
||||
|
||||
Networking::
|
||||
Telco core cluster networking conforms to the following requirements:
|
||||
@@ -45,3 +45,4 @@ Service Mesh::
|
||||
Telco CNFs can use Service Mesh.
|
||||
All telco core clusters require a Service Mesh implementation.
|
||||
The choice of implementation and configuration is outside the scope of this specification.
|
||||
|
||||
|
||||
@@ -7,7 +7,7 @@
|
||||
= CPU partitioning and performance tuning
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* No reference design updates in this release.
|
||||
|
||||
Description::
|
||||
CPU partitioning improves performance and reduces latency by separating sensitive workloads from general-purpose tasks, interrupts, and driver work queues.
|
||||
@@ -24,10 +24,8 @@ Limits and requirements::
|
||||
For more information, see "Creating a performance profile".
|
||||
|
||||
Engineering considerations::
|
||||
|
||||
include::snippets/nodes-cgroup-vi-removed.adoc[]
|
||||
|
||||
* The minimum reserved capacity (`systemReserved`) required can be found by following the guidance in the link:https://access.redhat.com/solutions/5843241[Which amount of CPU and memory are recommended to reserve for the system in OpenShift 4 nodes?] Knowledgebase article.
|
||||
* As of {product-title} 4.19, `cgroup v1` is no longer supported and has been removed. All workloads must now be compatible with `cgroup v2`. For more information, see link:https://www.redhat.com/en/blog/rhel-9-changes-context-red-hat-openshift-workloads[Red Hat Enterprise Linux 9 changes in the context of Red Hat OpenShift workloads](Red Hat Knowledgebase).
|
||||
* The minimum reserved capacity (`systemReserved`) required can be found by following the guidance in the Red Hat Knowledgebase solution link:https://access.redhat.com/solutions/5843241[Which amount of CPU and memory are recommended to reserve for the system in OCP 4 nodes?]
|
||||
* The actual required reserved CPU capacity depends on the cluster configuration and workload attributes.
|
||||
* The reserved CPU value must be rounded up to a full core (2 hyper-threads) alignment.
|
||||
* Changes to CPU partitioning cause the nodes contained in the relevant machine config pool to be drained and rebooted.
|
||||
@@ -44,8 +42,10 @@ You do not need to reserve an additional CPU for handling high network throughpu
|
||||
* If workloads running on the cluster use kernel level networking, the RX/TX queue count for the participating NICs should be set to 16 or 32 queues if the hardware permits it.
|
||||
Be aware of the default queue count.
|
||||
With no configuration, the default queue count is one RX/TX queue per online CPU; which can result in too many interrupts being allocated.
|
||||
* The irdma kernel module may result in the allocation of too many interrupt vectors on systems with high core counts. To prevent this condition the reference configuration excludes this kernel module from loading through a kernel commandline argument in the `PerformanceProfile`. Typically core workloads do not require this kernel module.
|
||||
+
|
||||
[NOTE]
|
||||
====
|
||||
Some drivers do not deallocate the interrupts even after reducing the queue count.
|
||||
====
|
||||
|
||||
|
||||
25
modules/telco-core-crs-cluster-infrastructure.adoc
Normal file
25
modules/telco-core-crs-cluster-infrastructure.adoc
Normal file
@@ -0,0 +1,25 @@
|
||||
// Module included in the following assemblies:
|
||||
//
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="cluster-infrastructure-crs_{context}"]
|
||||
= Cluster infrastructure reference CRs
|
||||
|
||||
.Cluster infrastructure CRs
|
||||
[cols="4*", options="header", format=csv]
|
||||
|====
|
||||
Component,Reference CR,Description,Optional
|
||||
Cluster logging,`ClusterLogForwarder.yaml`,Configures a log forwarding instance with the specified service account and verifies that the configuration is valid.,Yes
|
||||
Cluster logging,`ClusterLogNS.yaml`,Configures the cluster logging namespace.,Yes
|
||||
Cluster logging,`ClusterLogOperGroup.yaml`,"Creates the Operator group in the openshift-logging namespace, allowing the Cluster Logging Operator to watch and manage resources.",Yes
|
||||
Cluster logging,`ClusterLogServiceAccount.yaml`,Configures the cluster logging service account.,Yes
|
||||
Cluster logging,`ClusterLogServiceAccountAuditBinding.yaml`,Grants the collect-audit-logs cluster role to the logs collector service account.,Yes
|
||||
Cluster logging,`ClusterLogServiceAccountInfrastructureBinding.yaml`,Allows the collector service account to collect logs from infrastructure resources.,Yes
|
||||
Cluster logging,`ClusterLogSubscription.yaml`,Creates a subscription resource for the Cluster Logging Operator with manual approval for install plans.,Yes
|
||||
Disconnected configuration,`catalog-source.yaml`,Defines a disconnected Red Hat Operators catalog.,No
|
||||
Disconnected configuration,`idms.yaml`,Defines a list of mirrored repository digests for the disconnected registry.,No
|
||||
Disconnected configuration,`operator-hub.yaml`,Defines an OperatorHub configuration which disables all default sources.,No
|
||||
Monitoring and observability,`monitoring-config-cm.yaml`,Configuring storage and retention for Prometheus and Alertmanager.,Yes
|
||||
Power management,`PerformanceProfile.yaml`,"Defines a performance profile resource, specifying CPU isolation, hugepages configuration, and workload hints for performance optimization on selected nodes.",No
|
||||
|====
|
||||
@@ -2,6 +2,7 @@
|
||||
//
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="networking-crs_{context}"]
|
||||
= Networking reference CRs
|
||||
@@ -20,7 +21,7 @@ Load Balancer,`community.yaml`,"Defines a MetalLB community, which groups one or
|
||||
Load Balancer,`metallb.yaml`,Defines the MetalLB resource in the cluster.,No
|
||||
Load Balancer,`metallbNS.yaml`,Defines the metallb-system namespace in the cluster.,No
|
||||
Load Balancer,`metallbOperGroup.yaml`,Defines the Operator group for the MetalLB Operator.,No
|
||||
Load Balancer,`metallbSubscription.yaml`,Creates a subscription resource for the metallb Operator with manual approval for install plans.,No
|
||||
Load Balancer,`metallbSubscription.yaml`,Creates a subscription resource for the MetalLB Operator with manual approval for install plans.,No
|
||||
Multus - Tap CNI for rootless DPDK pods,`mc_rootless_pods_selinux.yaml`,Configures a MachineConfig resource which sets an SELinux boolean for the tap CNI plugin on worker nodes.,Yes
|
||||
NMState Operator,`NMState.yaml`,Defines an NMState resource that is used by the NMState Operator to manage node network configurations.,No
|
||||
NMState Operator,`NMStateNS.yaml`,Creates the NMState Operator namespace.,No
|
||||
|
||||
@@ -2,6 +2,10 @@
|
||||
//
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
// Module included in the following assemblies:
|
||||
//
|
||||
// *
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="node-configuration-crs_{context}"]
|
||||
= Node configuration reference CRs
|
||||
|
||||
@@ -10,5 +10,5 @@
|
||||
[cols="4*", options="header", format=csv]
|
||||
|====
|
||||
Component,Reference CR,Description,Optional
|
||||
System reserved capacity,`control-plane-system-reserved.yaml`,"Optional. Configures kubelet, enabling auto-sizing reserved resources for the control plane node pool.",No
|
||||
System reserved capacity,`control-plane-system-reserved.yaml`,"Optional. Configures kubelet, enabling auto-sizing reserved resources for the control plane node pool.",Yes
|
||||
|====
|
||||
|
||||
@@ -2,6 +2,10 @@
|
||||
//
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
// Module included in the following assemblies:
|
||||
//
|
||||
// *
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="scheduling-crs_{context}"]
|
||||
= Scheduling reference CRs
|
||||
|
||||
@@ -13,6 +13,5 @@ Component,Reference CR,Description,Optional
|
||||
External ODF configuration,`01-rook-ceph-external-cluster-details.secret.yaml`,Defines a Secret resource containing base64-encoded configuration data for an external Ceph cluster in the openshift-storage namespace.,No
|
||||
External ODF configuration,`02-ocs-external-storagecluster.yaml`,Defines an OpenShift Container Storage (OCS) storage resource which configures the cluster to use an external storage back end.,No
|
||||
External ODF configuration,`odfNS.yaml`,Creates the monitored openshift-storage namespace for the OpenShift Data Foundation Operator.,No
|
||||
External ODF configuration,`odfOperGroup.yaml`,"Creates the Operator group in the openshift-storage namespace, allowing the OpenShift Data Foundation Operator to watch and manage resources.",No
|
||||
External ODF configuration,`odfSubscription.yaml`,"Creates the subscription for the OpenShift Data Foundation Operator in the openshift-storage namespace.",No
|
||||
External ODF configuration,`odfOperGroup.yaml`,"Creates the Operator group in the openshift-storage namespace, allowing the {rh-storage} Operator to watch and manage resources.",No
|
||||
|====
|
||||
|
||||
43
modules/telco-core-deployment-planning.adoc
Normal file
43
modules/telco-core-deployment-planning.adoc
Normal file
@@ -0,0 +1,43 @@
|
||||
// Module included in the following assemblies:
|
||||
//
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="telco-core-deployment-planning_{context}"]
|
||||
= Deployment planning
|
||||
|
||||
*Worker nodes and machine config pools*
|
||||
|
||||
`MachineConfigPools` (MCPs) custom resource (CR) enable the subdivision of worker nodes in telco core clusters into different node groups based on customer planning parameters.
|
||||
Careful deployment planning using MCPs is crucial to minimize deployment and upgrade time and, more importantly, to minimize interruption of telco-grade services during cluster upgrades.
|
||||
|
||||
*Description*
|
||||
|
||||
Telco core clusters can use MCPs to split worker nodes into additional separate roles, for example, due to different hardware profiles. This allows custom tuning for each role and also plays a critical function in speeding up a telco core cluster deployment or upgrade. More importantly, multiple MCPs allow you to properly plan cluster upgrades across one or many maintenance windows. This is crucial because telco-grade services may otherwise be affected if careful planning is not considered.
|
||||
|
||||
During cluster upgrades, you can pause MCPs while you upgrade the control plane. See "Performing a canary rollout update" for more information. This ensures that worker nodes are not rebooted and running workloads remain unaffected until the MCP is unpaused.
|
||||
|
||||
Using careful MCP planning, you can control the timing and order of which set of nodes are upgraded at any time. For more information on how to use MCPs to plan telco upgrades, see "Applying MachineConfigPool labels to nodes before the update".
|
||||
|
||||
Before beginning the initial deployment, keep the following engineering considerations in mind regarding MCPs:
|
||||
|
||||
When using `PerformanceProfile` definitions, remember that each MCP must be linked to exactly one `PerformanceProfile` definition or tuned profile definition.
|
||||
Consequently, even if the desired configuration is identical for multiple MCPs, each MCP still requires its own dedicated `PerformanceProfile` definition.
|
||||
|
||||
Plan your MCP labeling with an appropriate strategy to split your worker nodes depending on considerations such as:
|
||||
|
||||
* The worker node type: identifying a group of nodes with equivalent hardware profile, for example, workers for control plane Network Functions (NFs) and workers for user data plane NFs.
|
||||
* The number of worker nodes per worker node type.
|
||||
* The minimum number of MCPs required for an equivalent hardware profile is 1, but could be larger for larger clusters.
|
||||
For example, you may design for more MCPs per hardware profile to support a more granular upgrade where a smaller percentage of the cluster capacity is affected with each step.
|
||||
* The strategy for performing updates on nodes within an MCP is shaped by upgrade requirements and the chosen `maxUnavailable` value:
|
||||
** Number of maintenance windows allowed.
|
||||
** Duration of a maintenance window.
|
||||
** Total number of worker nodes.
|
||||
** Desired `maxUnavailable` (number of nodes updated concurrently) for the MCP.
|
||||
* CNF requirements for worker nodes, in terms of:
|
||||
** Minimum availability per Pod required during an upgrade, configured with a pod disruption budget (PDB). PDBs are crucial to maintain telco service level Agreements (SLAs) during upgrades. For more information about PDB, see "Understanding how to use pod disruption budgets to specify the number of pods that must be up".
|
||||
** Minimum true high availability required per Pod, such that each replica runs on separate hardware.
|
||||
** Pod affinity and anti-affinity link: For more information about how to use pod affinity and anti-affinity, see "Placing pods relative to other pods using affinity and anti-affinity rules".
|
||||
* Duration and frequency of upgrade maintenance windows during which telco-grade services may be affected.
|
||||
|
||||
8
modules/telco-core-deployment.adoc
Normal file
8
modules/telco-core-deployment.adoc
Normal file
@@ -0,0 +1,8 @@
|
||||
// Module included in the following assemblies:
|
||||
//
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="telco-core-deployment_{context}"]
|
||||
= Deployment
|
||||
|
||||
@@ -7,7 +7,7 @@
|
||||
= Disconnected environment
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* No reference design updates in this release.
|
||||
|
||||
Descrption::
|
||||
Telco core clusters are expected to be installed in networks without direct access to the internet.
|
||||
@@ -24,3 +24,4 @@ Do not reuse the default catalog names.
|
||||
|
||||
Engineering considerations::
|
||||
* A valid time source must be configured as part of cluster installation
|
||||
|
||||
|
||||
@@ -4,10 +4,10 @@
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="telco-core-gitops-operator-and-ztp-plugins_{context}"]
|
||||
= GitOps Operator and GitOps ZTP plugins
|
||||
= GitOps Operator and ZTP plugins
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* No reference design updates in this release.
|
||||
|
||||
Description::
|
||||
+
|
||||
@@ -21,19 +21,20 @@ The SiteConfig Operator provides improved support for generation of `Installatio
|
||||
|
||||
[IMPORTANT]
|
||||
====
|
||||
Where possible, use `ClusterInstance` CRs for cluster installation instead of the `SiteConfig` with {ztp} plugin method.
|
||||
Using `ClusterInstance` CRs for cluster installation is preferred over the `SiteConfig` custom resource with ZTP plugin method.
|
||||
====
|
||||
|
||||
You should structure the Git repository according to release version, with all necessary artifacts (`SiteConfig`, `ClusterInstance`, `PolicyGenerator`, and `PolicyGenTemplate`, and supporting reference CRs) included.
|
||||
This enables deploying and managing multiple versions of the OpenShift platform and configuration versions to clusters simultaneously and through upgrades.
|
||||
This enables deploying and managing multiple versions of the {product-title} and configuration versions to clusters simultaneously and through upgrades.
|
||||
|
||||
The recommended Git structure keeps reference CRs in a directory separate from customer or partner provided content.
|
||||
This means that you can import reference updates by simply overwriting existing content.
|
||||
Customer or partner-supplied CRs can be provided in a parallel directory to the reference CRs for easy inclusion in the generated configuration policies.
|
||||
Customer or partner supplied CRs can be provided in a parallel directory to the reference CRs for easy inclusion in the generated configuration policies.
|
||||
--
|
||||
|
||||
Limits and requirements::
|
||||
* Each ArgoCD application supports up to 300 nodes.
|
||||
// Scale results ACM-17868
|
||||
* Each ArgoCD application supports up to 800 nodes.
|
||||
Multiple ArgoCD applications can be used to achieve the maximum number of clusters supported by a single hub cluster.
|
||||
* The `SiteConfig` CR must use the `extraManifests.searchPaths` field to reference the reference manifests.
|
||||
+
|
||||
@@ -43,15 +44,16 @@ Since {product-title} 4.15, the `spec.extraManifestPath` field is deprecated.
|
||||
====
|
||||
|
||||
Engineering considerations::
|
||||
* Set the `MachineConfigPool` (`mcp`) CR `paused` field to true during a cluster upgrade maintenance window and set the `maxUnavailable` field to the maximum tolerable value.
|
||||
* Set the `MachineConfigPool` (`MCP`) CR `paused` field to true during a cluster upgrade maintenance window and set the `maxUnavailable` field to the maximum tolerable value.
|
||||
This prevents multiple cluster node reboots during upgrade, which results in a shorter overall upgrade.
|
||||
When you unpause the `mcp` CR, all the configuration changes are applied with a single reboot.
|
||||
+
|
||||
[NOTE]
|
||||
====
|
||||
During installation, custom `mcp` CRs can be paused along with setting `maxUnavailable` to 100% to improve installation times.
|
||||
During installation, custom `MCP` CRs can be paused along with setting `maxUnavailable` to 100% to improve installation times.
|
||||
====
|
||||
|
||||
* To avoid confusion or unintentional overwriting when updating content, you should use unique and distinguishable names for custom CRs in the `reference-crs/` directory under core-overlay and extra manifests in Git.
|
||||
* To avoid confusion or unintentional overwriting when updating content, you should use unique and distinguishable names for custom CRs in the `reference-crs/` directory under core-overlay and extra manifests in git.
|
||||
* The `SiteConfig` CR allows multiple extra-manifest paths.
|
||||
When file names overlap in multiple directory paths, the last file found in the directory order list takes precedence.
|
||||
|
||||
|
||||
@@ -7,7 +7,7 @@
|
||||
= Host firmware and boot loader configuration
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* No reference design updates in this release.
|
||||
|
||||
Engineering considerations::
|
||||
// https://issues.redhat.com/browse/CNF-11806
|
||||
@@ -18,3 +18,4 @@ Engineering considerations::
|
||||
When secure boot is enabled, only signed kernel modules are loaded by the kernel.
|
||||
Out-of-tree drivers are not supported.
|
||||
====
|
||||
|
||||
|
||||
8
modules/telco-core-installation.adoc
Normal file
8
modules/telco-core-installation.adoc
Normal file
@@ -0,0 +1,8 @@
|
||||
// Module included in the following assemblies:
|
||||
//
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="telco-core-installation_{context}"]
|
||||
= Installation
|
||||
|
||||
@@ -7,9 +7,8 @@
|
||||
= Load balancer
|
||||
|
||||
New in this release::
|
||||
// https://issues.redhat.com/browse/CNF-14150
|
||||
* FRR-K8s is now available under the Cluster Network Operator.
|
||||
+
|
||||
* No reference design updates in this release.
|
||||
|
||||
[IMPORTANT]
|
||||
====
|
||||
If you have custom `FRRConfiguration` CRs in the `metallb-system` namespace, you must move them under the `openshift-network-operator` namespace.
|
||||
@@ -38,3 +37,4 @@ See `routingViaHost` in "Cluster Network Operator".
|
||||
** MetalLB uses BGP for announcing routes only.
|
||||
Only the `transmitInterval` and `minimumTtl` parameters are relevant in this mode.
|
||||
Other parameters in the BFD profile should remain close to the defaults as shorter values can lead to false negatives and affect performance.
|
||||
|
||||
|
||||
@@ -10,8 +10,7 @@ New in this release::
|
||||
* No reference design updates in this release
|
||||
|
||||
Description::
|
||||
The Cluster Logging Operator enables collection and shipping of logs off the node for remote archival and analysis.
|
||||
The reference configuration uses Kafka to ship audit and infrastructure logs to a remote archive.
|
||||
The Cluster Logging Operator enables collection and shipping of logs off the node for remote archival and analysis. The reference configuration uses Kafka to ship audit and infrastructure logs to a remote archive.
|
||||
|
||||
Limits and requirements::
|
||||
Not applicable
|
||||
@@ -20,3 +19,4 @@ Engineering considerations::
|
||||
* The impact of cluster CPU use is based on the number or size of logs generated and the amount of log filtering configured.
|
||||
* The reference configuration does not include shipping of application logs.
|
||||
The inclusion of application logs in the configuration requires you to evaluate the application logging rate and have sufficient additional CPU resources allocated to the reserved set.
|
||||
|
||||
|
||||
@@ -7,17 +7,20 @@
|
||||
= Monitoring
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* No reference design updates in this release.
|
||||
|
||||
Description::
|
||||
+
|
||||
--
|
||||
The Cluster Monitoring Operator (CMO) is included by default in {product-title} and provides monitoring (metrics, dashboards, and alerting) for the platform components and optionally user projects.
|
||||
|
||||
You can customize the default log retention period, custom alert rules, and so on.
|
||||
|
||||
The default handling of pod CPU and memory metrics, based on upstream Kubernetes and cAdvisor, makes a tradeoff favoring stale data over metric accuracy.
|
||||
This leads to spikes in reporting, which can create false alerts, depending on the user-specified thresholds.
|
||||
{product-title} supports an opt-in Dedicated Service Monitor feature that creates an additional set of pod CPU and memory metrics that do not suffer from this behavior.
|
||||
For more information, see link:https://access.redhat.com/solutions/7012719[Dedicated Service Monitors - Questions and Answers (Red Hat Knowledgebase)].
|
||||
|
||||
{product-title} supports an opt-in Dedicated Service Monitor feature that creates an additional set of pod CPU and memory metrics which do not suffer from this behavior.
|
||||
For more information, see the Red Hat Knowledgebase solution link:https://access.redhat.com/solutions/7012719[Dedicated Service Monitors - Questions and Answers].
|
||||
|
||||
In addition to the default configuration, the following metrics are expected to be configured for telco core clusters:
|
||||
|
||||
@@ -25,9 +28,10 @@ In addition to the default configuration, the following metrics are expected to
|
||||
--
|
||||
|
||||
Limits and requirements::
|
||||
* You must enable the Dedicated Service Monitor feature to represent pod metrics accurately.
|
||||
* You must enable the Dedicated Service Monitor feature for accurate representation of pod metrics.
|
||||
|
||||
Engineering considerations::
|
||||
* The Prometheus retention period is specified by the user.
|
||||
The value used is a tradeoff between operational requirements for maintaining historical data on the cluster against CPU and storage resources.
|
||||
Longer retention periods increase the need for storage and require additional CPU to manage data indexing.
|
||||
Longer retention periods increase the need for storage and require additional CPU to manage the indexing of data.
|
||||
|
||||
|
||||
@@ -11,26 +11,15 @@ The following diagram describes the telco core reference design networking confi
|
||||
.Telco core reference design networking configuration
|
||||
image::openshift-telco-core-rds-networking.png[Overview of the telco core reference design networking configuration]
|
||||
|
||||
|
||||
New in this release::
|
||||
+
|
||||
--
|
||||
// https://issues.redhat.com/browse/CNF-12678
|
||||
* Support for disabling vendor plugins in the SR-IOV Operator
|
||||
* Extend telco core validation with pod-level bonding.
|
||||
* Support moving failed policy in resource injector to failed for SR-IOV operator.
|
||||
|
||||
// https://issues.redhat.com/browse/CNF-13768
|
||||
* link:https://access.redhat.com/articles/7090422[New knowledge base article on creating custom node firewall rules]
|
||||
|
||||
// https://issues.redhat.com/browse/CNF-13981
|
||||
* Extended telco core RDS validation with MetalLB and EgressIP telco QE validation
|
||||
|
||||
// https://issues.redhat.com/browse/CNF-14150
|
||||
* FRR-K8s is now available under the Cluster Network Operator.
|
||||
+
|
||||
[NOTE]
|
||||
====
|
||||
If you have custom `FRRConfiguration` CRs in the `metallb-system` namespace, you must move them under the `openshift-network-operator` namespace.
|
||||
====
|
||||
--
|
||||
|
||||
Description::
|
||||
+
|
||||
@@ -52,10 +41,10 @@ For more information, see "Cluster Network Operator".
|
||||
.. Configure VLAN interfaces and specific kernel IP routes on the nodes using `NodeNetworkConfigurationPolicy` CRs.
|
||||
.. Create a MetalLB `BGPPeer` CR for each VLAN to establish peering with the remote BGP router.
|
||||
.. Define a MetalLB `BGPAdvertisement` CR to specify which IP address pools should be advertised to a selected list of `BGPPeer` resources.
|
||||
+
|
||||
The following diagram illustrates how specific service IP addresses are advertised to the outside via specific VLAN interfaces.
|
||||
Services routes are defined in `BGPAdvertisement` CRs and configured with values for `IPAddressPool1` and `BGPPeer1` fields.
|
||||
--
|
||||
|
||||
.Telco core reference design MetalLB service separation
|
||||
image::openshift-telco-core-rds-metallb-service-separation.png[Telco core reference design MetalLB service separation]
|
||||
|
||||
|
||||
@@ -21,3 +21,4 @@ Engineering considerations::
|
||||
* Initial networking configuration is applied using `NMStateConfig` content in the installation CRs.
|
||||
The NMState Operator is used only when required for network updates.
|
||||
* When SR-IOV virtual functions are used for host networking, the NMState Operator (via `nodeNetworkConfigurationPolicy` CRs) is used to configure VF interfaces, such as VLANs and MTU.
|
||||
|
||||
|
||||
@@ -7,12 +7,11 @@
|
||||
= Node Configuration
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* No reference design updates in this release.
|
||||
|
||||
Limits and requirements::
|
||||
* Analyze additional kernel modules to determine impact on CPU load, system performance, and ability to meet KPIs.
|
||||
+
|
||||
--
|
||||
.Additional kernel modules
|
||||
|====
|
||||
|Feature|Description
|
||||
@@ -39,4 +38,3 @@ a|Install the following kernel modules by using `MachineConfig` CRs to provide e
|
||||
Creates a container mount namespace, visible to kubelet/CRI-O, to reduce system mount scanning overhead.
|
||||
|Kdump enable|Optional configuration (enabled by default)
|
||||
|====
|
||||
--
|
||||
|
||||
@@ -4,19 +4,34 @@
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="telco-core-openshift-data-foundation_{context}"]
|
||||
= Red Hat OpenShift Data Foundation
|
||||
= {rh-storage}
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* Clarification on internal compared to external mode and RDS recommendations.
|
||||
|
||||
Description::
|
||||
{rh-storage-first} is a software-defined storage service for containers.
|
||||
For telco core clusters, storage support is provided by {rh-storage} storage services running externally to the application workload cluster.
|
||||
+
|
||||
--
|
||||
{rh-storage} is a software-defined storage service for containers.
|
||||
{rh-storage} can be deployed in one of two modes:
|
||||
* Internal mode, where {rh-storage} software components are deployed as software containers directly on the {product-title} cluster nodes, together with other containerized applications.
|
||||
* External mode, where {rh-storage} is deployed on a dedicated storage cluster, which is usually a separate Red Hat Ceph Storage cluster running on {op-system-base-full}.
|
||||
These storage services are running externally to the application workload cluster.
|
||||
|
||||
For telco core clusters, storage support is provided by {rh-storage} storage services running in external mode, for several reasons:
|
||||
|
||||
* Separating dependencies between {product-title} and Ceph operations allows for independent {product-title} and {rh-storage} updates.
|
||||
* Separation of operations functions for the Storage and {product-title} infrastructure layers, is a typical customer requirement for telco core use cases.
|
||||
* External Red Hat Ceph Storage clusters can be re-used by multiple {product-title} clusters deployed in the same region.
|
||||
|
||||
{rh-storage} supports separation of storage traffic using secondary CNI networks.
|
||||
--
|
||||
|
||||
Limits and requirements::
|
||||
* In an IPv4/IPv6 dual-stack networking environment, {rh-storage} uses IPv4 addressing.
|
||||
For more information, see link:https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.17/html/planning_your_deployment/network-requirements_rhodf#network-requirements_rhodf[Network requirements].
|
||||
For more information, see link:https://docs.redhat.com/en/documentation/red_hat_openshift_data_foundation/4.19/html/planning_your_deployment/network-requirements_rhodf#ipv6-support_rhodf[IPv6 support].
|
||||
|
||||
Engineering considerations::
|
||||
* {rh-storage} network traffic should be isolated from other traffic on a dedicated network, for example, by using VLAN isolation.
|
||||
* Workload requirements must be scoped before attaching multiple {product-title} clusters to an external {rh-storage} cluster to ensure sufficient throughput, bandwidth, and performance KPIs.
|
||||
|
||||
|
||||
@@ -10,7 +10,7 @@ New in this release::
|
||||
* No reference design updates in this release
|
||||
|
||||
Description::
|
||||
Use the Performance Profile to configure clusters with high power mode, low power mode, or mixed mode.
|
||||
Use the Performance profile to configure clusters with high power mode, low power mode, or mixed mode.
|
||||
The choice of power mode depends on the characteristics of the workloads running on the cluster, particularly how sensitive they are to latency.
|
||||
Configure the maximum latency for a low-latency pod by using the per-pod power management C-states feature.
|
||||
|
||||
@@ -21,3 +21,4 @@ Configuration varies between hardware vendors.
|
||||
Engineering considerations::
|
||||
* Latency: To ensure that latency-sensitive workloads meet requirements, you require a high-power or a per-pod power management configuration.
|
||||
Per-pod power management is only available for Guaranteed QoS pods with dedicated pinned CPUs.
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
// Module included in the following assemblies:
|
||||
//
|
||||
// * scalability_and_performance/telco_ref_design_specs/core/telco-core-ref-crs.adoc
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
:_mod-docs-content-type: PROCEDURE
|
||||
[id="telco-core-rds-container_{context}"]
|
||||
@@ -15,7 +15,14 @@ The container image has both the required CRs, and the optional CRs, for the tel
|
||||
|
||||
.Procedure
|
||||
|
||||
* Extract the content from the `telco-core-rds-rhel9` container image by running the following commands:
|
||||
. Log on to the container image registry with your credentials by running the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ podman login registry.redhat.io
|
||||
----
|
||||
|
||||
. Extract the content from the `telco-core-rds-rhel9` container image by running the following commands:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
@@ -24,36 +31,85 @@ $ mkdir -p ./out
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ podman run -it registry.redhat.io/openshift4/openshift-telco-core-rds-rhel9:v4.18 | base64 -d | tar xv -C out
|
||||
$ podman run -it registry.redhat.io/openshift4/openshift-telco-core-rds-rhel9:v4.19 | base64 -d | tar xv -C out
|
||||
----
|
||||
|
||||
.Verification
|
||||
|
||||
* The `out` directory has the following directory structure. You can view the telco core CRs in the `out/telco-core-rds/` directory.
|
||||
* The `out` directory has the following directory structure. You can view the telco core CRs in the `out/telco-core-rds/` directory by running the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ tree -L 4
|
||||
----
|
||||
+
|
||||
.Example output
|
||||
[source,text]
|
||||
----
|
||||
out/
|
||||
└── telco-core-rds
|
||||
├── configuration
|
||||
│ └── reference-crs
|
||||
│ ├── optional
|
||||
│ │ ├── logging
|
||||
│ │ ├── networking
|
||||
│ │ │ └── multus
|
||||
│ │ │ └── tap_cni
|
||||
│ │ ├── other
|
||||
│ │ └── tuning
|
||||
│ └── required
|
||||
│ ├── networking
|
||||
│ │ ├── metallb
|
||||
│ │ ├── multinetworkpolicy
|
||||
│ │ └── sriov
|
||||
│ ├── other
|
||||
│ ├── performance
|
||||
│ ├── scheduling
|
||||
│ └── storage
|
||||
│ └── odf-external
|
||||
└── install
|
||||
.
|
||||
├── configuration
|
||||
│ ├── compare.sh
|
||||
│ ├── core-baseline.yaml
|
||||
│ ├── core-finish.yaml
|
||||
│ ├── core-overlay.yaml
|
||||
│ ├── core-upgrade.yaml
|
||||
│ ├── kustomization.yaml
|
||||
│ ├── Makefile
|
||||
│ ├── ns.yaml
|
||||
│ ├── README.md
|
||||
│ ├── reference-crs
|
||||
│ │ ├── custom-manifests
|
||||
│ │ │ ├── mcp-worker-1.yaml
|
||||
│ │ │ ├── mcp-worker-2.yaml
|
||||
│ │ │ ├── mcp-worker-3.yaml
|
||||
│ │ │ └── README.md
|
||||
│ │ ├── optional
|
||||
│ │ │ ├── logging
|
||||
│ │ │ ├── networking
|
||||
│ │ │ ├── other
|
||||
│ │ │ └── tuning
|
||||
│ │ └── required
|
||||
│ │ ├── networking
|
||||
│ │ ├── other
|
||||
│ │ ├── performance
|
||||
│ │ ├── scheduling
|
||||
│ │ └── storage
|
||||
│ ├── reference-crs-kube-compare
|
||||
│ │ ├── compare_ignore
|
||||
│ │ ├── comparison-overrides.yaml
|
||||
│ │ ├── metadata.yaml
|
||||
│ │ ├── optional
|
||||
│ │ │ ├── logging
|
||||
│ │ │ ├── networking
|
||||
│ │ │ ├── other
|
||||
│ │ │ └── tuning
|
||||
│ │ ├── ReferenceVersionCheck.yaml
|
||||
│ │ ├── required
|
||||
│ │ │ ├── networking
|
||||
│ │ │ ├── other
|
||||
│ │ │ ├── performance
|
||||
│ │ │ ├── scheduling
|
||||
│ │ │ └── storage
|
||||
│ │ ├── unordered_list.tmpl
|
||||
│ │ └── version_match.tmpl
|
||||
│ └── template-values
|
||||
│ ├── hw-types.yaml
|
||||
│ └── regional.yaml
|
||||
├── install
|
||||
│ ├── custom-manifests
|
||||
│ │ ├── mcp-worker-1.yaml
|
||||
│ │ ├── mcp-worker-2.yaml
|
||||
│ │ └── mcp-worker-3.yaml
|
||||
│ ├── example-standard.yaml
|
||||
│ ├── extra-manifests
|
||||
│ │ ├── control-plane-load-kernel-modules.yaml
|
||||
│ │ ├── kdump-master.yaml
|
||||
│ │ ├── kdump-worker.yaml
|
||||
│ │ ├── mc_rootless_pods_selinux.yaml
|
||||
│ │ ├── mount_namespace_config_master.yaml
|
||||
│ │ ├── mount_namespace_config_worker.yaml
|
||||
│ │ ├── sctp_module_mc.yaml
|
||||
│ │ └── worker-load-kernel-modules.yaml
|
||||
│ └── README.md
|
||||
└── README.md
|
||||
----
|
||||
|
||||
@@ -6,6 +6,6 @@
|
||||
[id="telco-core-rds-product-version-use-model-overview_{context}"]
|
||||
= Telco core RDS {product-version} use model overview
|
||||
|
||||
The telco core reference design specifications (RDS) describes a platform that supports large-scale telco applications, including control plane functions such as signaling and aggregation.
|
||||
It also includes some centralized data plane functions, such as user plane functions (UPF).
|
||||
These functions generally require scalability, complex networking support, resilient software-defined storage, and support performance requirements that are less stringent and constrained than far-edge deployments such as RAN.
|
||||
The Telco core reference design specification (RDS) describes a platform that supports large-scale telco applications including control plane functions such as signaling and aggregation.
|
||||
It also includes some centralized data plane functions, for example, user plane functions (UPF).
|
||||
These functions generally require scalability, complex networking support, resilient software-defined storage, and support performance requirements that are less stringent and constrained than far-edge deployments such as RAN.
|
||||
@@ -7,15 +7,16 @@
|
||||
= Red Hat Advanced Cluster Management
|
||||
|
||||
New in this release::
|
||||
* Image-based installation for {sno} clusters is the recommended install methodology.
|
||||
* Using {rh-rhacm} and PolicyGenerator CRs is the recommended approach for managing and deploying policies to managed clusters.
|
||||
This replaces the use of PolicyGenTemplate CRs for this purpose.
|
||||
|
||||
Description::
|
||||
+
|
||||
--
|
||||
{rh-rhacm-first} provides Multi Cluster Engine (MCE) installation and ongoing {ztp} lifecycle management for deployed clusters.
|
||||
{rh-rhacm} provides Multi Cluster Engine (MCE) installation and ongoing {ztp} lifecycle management for deployed clusters.
|
||||
You manage cluster configuration and upgrades declaratively by applying `Policy` custom resources (CRs) to clusters during maintenance windows.
|
||||
|
||||
You apply policies with the {rh-rhacm} policy controller as managed by {cgu-operator-full}.
|
||||
You apply policies with the {rh-rhacm} policy controller as managed by {cgu-operator}.
|
||||
Configuration, upgrades, and cluster status are managed through the policy controller.
|
||||
|
||||
When installing managed clusters, {rh-rhacm} applies labels and initial ignition configuration to individual nodes in support of custom disk partitioning, allocation of roles, and allocation to machine config pools.
|
||||
@@ -24,10 +25,11 @@ You define these configurations with `SiteConfig` or `ClusterInstance` CRs.
|
||||
|
||||
Limits and requirements::
|
||||
|
||||
* Hub cluster sizing is discussed in link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html-single/install/index#sizing-your-cluster[Sizing your cluster].
|
||||
* Hub cluster sizing is discussed in link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/install/index#sizing-your-cluster[Sizing your cluster].
|
||||
|
||||
* {rh-rhacm} scaling limits are described in link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.11/html-single/install/index#performance-and-scalability[Performance and Scalability].
|
||||
* {rh-rhacm} scaling limits are described in link:https://docs.redhat.com/en/documentation/red_hat_advanced_cluster_management_for_kubernetes/2.13/html-single/install/index#performance-and-scalability[Performance and Scalability].
|
||||
|
||||
Engineering considerations::
|
||||
* When managing multiple clusters with unique content per installation, site, or deployment, using {rh-rhacm} hub templating is strongly recommended.
|
||||
{rh-rhacm} hub templating allows you to apply a consistent set of policies to clusters while providing for unique values per installation.
|
||||
|
||||
|
||||
@@ -0,0 +1,11 @@
|
||||
// Module included in the following assemblies:
|
||||
//
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="telco-core-reference-design-specification-for-product-title-product-version_{context}"]
|
||||
= Telco core reference design specifications
|
||||
|
||||
The telco core reference design specification (RDS) configures an {product-title} cluster running on commodity hardware to host telco core workloads.
|
||||
|
||||
|
||||
@@ -7,10 +7,14 @@
|
||||
= Scalability
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* No reference design updates in this release.
|
||||
|
||||
Description::
|
||||
Scale clusters as described in "Limits and requirements".
|
||||
Scaling of workloads is described in "Application workloads".
|
||||
|
||||
Limits and requirements::
|
||||
* Cluster can scale to at least 120 nodes.
|
||||
|
||||
|
||||
:leveloffset!:
|
||||
|
||||
@@ -7,14 +7,14 @@
|
||||
= Scheduling
|
||||
|
||||
New in this release::
|
||||
* No reference design updates in this release
|
||||
* No reference design updates in this release.
|
||||
|
||||
Description::
|
||||
+
|
||||
--
|
||||
The scheduler is a cluster-wide component responsible for selecting the correct node for a given workload.
|
||||
The scheduler is a cluster-wide component responsible for selecting the right node for a given workload.
|
||||
It is a core part of the platform and does not require any specific configuration in the common deployment scenarios.
|
||||
However, a few specific use cases are described in the following section.
|
||||
However, there are few specific use cases described in the following section.
|
||||
|
||||
NUMA-aware scheduling can be enabled through the NUMA Resources Operator.
|
||||
For more information, see "Scheduling NUMA-aware workloads".
|
||||
@@ -23,13 +23,13 @@ For more information, see "Scheduling NUMA-aware workloads".
|
||||
Limits and requirements::
|
||||
* The default scheduler does not understand the NUMA locality of workloads.
|
||||
It only knows about the sum of all free resources on a worker node.
|
||||
This might cause workloads to be rejected when scheduled to a node with the topology manager policy set to `single-numa-node` or `restricted`.
|
||||
For more information, see "Topology Manager policies".
|
||||
** For example, consider a pod requesting 6 CPUs that is scheduled to an empty node that has 4 CPUs per NUMA node.
|
||||
This might cause workloads to be rejected when scheduled to a node with the topology manager policy set to single-numa-node or restricted. For more information, see "Topology Manager policies".
|
||||
+
|
||||
For example, consider a pod requesting 6 CPUs and being scheduled to an empty node that has 4 CPUs per NUMA node.
|
||||
The total allocatable capacity of the node is 8 CPUs. The scheduler places the pod on the empty node.
|
||||
The node local admission fails, as there are only 4 CPUs available in each of the NUMA nodes.
|
||||
* All clusters with multi-NUMA nodes are required to use the NUMA Resources Operator.
|
||||
See "Installing the NUMA Resources Operator" for more information.
|
||||
|
||||
* All clusters with multi-NUMA nodes are required to use the NUMA Resources Operator. See "Installing the NUMA Resources Operator" for more information.
|
||||
Use the `machineConfigPoolSelector` field in the `KubeletConfig` CR to select all nodes where NUMA aligned scheduling is required.
|
||||
* All machine config pools must have consistent hardware configuration.
|
||||
For example, all nodes are expected to have the same NUMA zone count.
|
||||
@@ -37,4 +37,5 @@ For example, all nodes are expected to have the same NUMA zone count.
|
||||
Engineering considerations::
|
||||
* Pods might require annotations for correct scheduling and isolation.
|
||||
For more information about annotations, see "CPU partitioning and performance tuning".
|
||||
* You can configure SR-IOV virtual function NUMA affinity to be ignored during scheduling by using the `excludeTopology` field in `SriovNetworkNodePolicy` CR.
|
||||
* You can configure SR-IOV virtual function NUMA affinity to be ignored during scheduling by using the excludeTopology field in `SriovNetworkNodePolicy` CR.
|
||||
|
||||
|
||||
@@ -7,54 +7,48 @@
|
||||
= Security
|
||||
|
||||
New in this release::
|
||||
// https://issues.redhat.com/browse/CNF-13768
|
||||
* link:https://access.redhat.com/articles/7090422[New knowledgebase article on creating custom node firewall rules]
|
||||
* No reference design updates in this release.
|
||||
|
||||
Description::
|
||||
+
|
||||
--
|
||||
Telco customers are security conscious and require clusters to be hardened against multiple attack vectors.
|
||||
In {product-title}, there is no single component or feature responsible for securing a cluster.
|
||||
Use the following security-oriented features and configurations to secure your clusters:
|
||||
Described below are various security oriented features and configurations for the use models covered in the telco core RDS.
|
||||
|
||||
* **SecurityContextConstraints (SCC)**: All workload pods should be run with `restricted-v2` or `restricted` SCC.
|
||||
* **Seccomp**: All pods should run with the `RuntimeDefault` (or stronger) seccomp profile.
|
||||
* **Rootless DPDK pods**: Many user-plane networking (DPDK) CNFs require pods to run with root privileges.
|
||||
With this feature, a conformant DPDK pod can run without requiring root privileges.
|
||||
With this feature, a conformant DPDK pod can be run without requiring root privileges.
|
||||
Rootless DPDK pods create a tap device in a rootless pod that injects traffic from a DPDK application to the kernel.
|
||||
* **Storage**: The storage network should be isolated and non-routable to other cluster networks.
|
||||
See the "Storage" section for additional details.
|
||||
|
||||
Refer to link:https://access.redhat.com/articles/7090422[Custom nftable firewall rules in OpenShift] for a supported method of implementing custom nftables firewall rules in OpenShift cluster nodes.
|
||||
This article is intended for cluster administrators who are responsible for managing network security policies in OpenShift environments.
|
||||
See the Red Hat Knowledgebase solution article link:https://access.redhat.com/articles/7090422[Custom nftable firewall rules in OpenShift] for a supported method for implementing custom nftables firewall rules in {product-title} cluster nodes. This article is intended for cluster administrators who are responsible for managing network security policies in {product-title} environments.
|
||||
|
||||
It is crucial to carefully consider the operational implications before deploying this method, including:
|
||||
|
||||
* **Early application**: The rules are applied at boot time, before the network is fully operational.
|
||||
Ensure the rules don't inadvertently block essential services required during the boot process.
|
||||
|
||||
* **Risk of misconfiguration**: Errors in your custom rules can lead to unintended consequences, potentially leading to performance impact or blocking legitimate traffic or isolating nodes.
|
||||
Thoroughly test your rules in a non-production environment before deploying them to your main cluster.
|
||||
|
||||
* **External endpoints**: OpenShift requires access to external endpoints to function.
|
||||
For more information about the firewall allowlist, see "Configuring your firewall for {product-title}".
|
||||
Ensure that cluster nodes are permitted access to those endpoints.
|
||||
|
||||
* **External endpoints**: {product-title} requires access to external endpoints to function.
|
||||
For more information about the firewall allowlist, see "Configuring your firewall for {product-title}". Ensure that cluster nodes are permitted access to those endpoints.
|
||||
* **Node reboot**: Unless node disruption policies are configured, applying the `MachineConfig` CR with the required firewall settings causes a node reboot.
|
||||
Be aware of this impact and schedule a maintenance window accordingly.
|
||||
For more information, see "Using node disruption policies to minimize disruption from machine config changes".
|
||||
Be aware of this impact and schedule a maintenance window accordingly. For more information, see "Using node disruption policies to minimize disruption from machine config changes".
|
||||
+
|
||||
[NOTE]
|
||||
====
|
||||
Node disruption policies are available in {product-title} 4.17 and later.
|
||||
====
|
||||
|
||||
* **Network flow matrix**: For more information about managing ingress traffic, see "{product-title} network flow matrix".
|
||||
* **Network flow matrix**: For more information about managing ingress traffic, see {product-title} network flow matrix.
|
||||
You can restrict ingress traffic to essential flows to improve network security.
|
||||
The matrix provides insights into base cluster services but excludes traffic generated by Day-2 Operators.
|
||||
|
||||
* **Cluster version updates and upgrades**: Exercise caution when updating or upgrading OpenShift clusters.
|
||||
* **Cluster version updates and upgrades**: Exercise caution when updating or upgrading {product-title} clusters.
|
||||
Recent changes to the platform's firewall requirements might require adjustments to network port permissions.
|
||||
Although the documentation provides guidelines, note that these requirements can evolve over time.
|
||||
While the documentation provides guidelines, note that these requirements can evolve over time.
|
||||
To minimize disruptions, you should test any updates or upgrades in a staging environment before applying them in production.
|
||||
This helps you to identify and address potential compatibility issues related to firewall configuration changes.
|
||||
--
|
||||
@@ -62,8 +56,9 @@ This helps you to identify and address potential compatibility issues related to
|
||||
Limits and requirements::
|
||||
* Rootless DPDK pods requires the following additional configuration:
|
||||
** Configure the `container_t` SELinux context for the tap plugin.
|
||||
** Enable the `container_use_devices` SELinux boolean for the cluster host.
|
||||
** Enable the `container_use_devices` SELinux boolean for the cluster host
|
||||
|
||||
Engineering considerations::
|
||||
* For rootless DPDK pod support, enable the SELinux `container_use_devices` boolean on the host to allow the tap device to be created.
|
||||
This introduces an acceptable security risk.
|
||||
|
||||
|
||||
@@ -11,3 +11,4 @@ Telco core cloud-native functions (CNFs) typically require a service mesh implem
|
||||
Specific service mesh features and performance requirements are dependent on the application.
|
||||
The selection of service mesh implementation and configuration is outside the scope of this documentation.
|
||||
You must account for the impact of service mesh on cluster resource usage and performance, including additional latency introduced in pod networking, in your implementation.
|
||||
|
||||
|
||||
@@ -8,4 +8,4 @@
|
||||
|
||||
Signaling workloads typically use SCTP, REST, gRPC or similar TCP or UDP protocols.
|
||||
Signaling workloads support hundreds of thousands of transactions per second (TPS) by using a secondary multus CNI configured as MACVLAN or SR-IOV interface.
|
||||
These workloads can run in pods with either guaranteed or burstable QoS.
|
||||
These workloads can run in pods with either guaranteed or burstable QoS.
|
||||
@@ -3,7 +3,12 @@
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="telco-core-software-stack_{context}"]
|
||||
// Module included in the following assemblies:
|
||||
//
|
||||
// * scalability_and_performance/telco_core_ref_design_specs/telco-core-rds.adoc
|
||||
|
||||
:_mod-docs-content-type: REFERENCE
|
||||
[id="telco-core-software-stack_{context}_{context}"]
|
||||
= Telco core reference configuration software specifications
|
||||
|
||||
The Red{nbsp}Hat telco core {product-version} solution has been validated using the following Red{nbsp}Hat software products for {product-title} clusters.
|
||||
@@ -14,26 +19,26 @@ The Red{nbsp}Hat telco core {product-version} solution has been validated using
|
||||
|Component |Software version
|
||||
|
||||
|{rh-rhacm-first}
|
||||
|2.12^1^
|
||||
|2.13^1^
|
||||
|
||||
|Cluster Logging Operator
|
||||
|6.1^2^
|
||||
|6.2^2^
|
||||
|
||||
|{rh-storage}
|
||||
|4.18
|
||||
|4.19
|
||||
|
||||
|SR-IOV Network Operator
|
||||
|4.18
|
||||
|4.19
|
||||
|
||||
|MetalLB
|
||||
|4.18
|
||||
|4.19
|
||||
|
||||
|NMState Operator
|
||||
|4.18
|
||||
|4.19
|
||||
|
||||
|NUMA-aware scheduler
|
||||
|4.18
|
||||
|4.19
|
||||
|====
|
||||
[1] This table will be updated when the aligned {rh-rhacm} version 2.13 is released.
|
||||
[1] This table will be updated when the aligned {rh-rhacm} version 2.14 is released.
|
||||
|
||||
[2] This table will be updated when the aligned Cluster Logging Operator 6.2 is released.
|
||||
[2] This table will be updated when the aligned Cluster Logging Operator 6.3 is released.
|
||||
|
||||
@@ -7,9 +7,7 @@
|
||||
= SR-IOV
|
||||
|
||||
New in this release::
|
||||
// https://issues.redhat.com/browse/CNF-12678
|
||||
* You can now create virtual functions for Mellanox NICs with the SR-IOV Network Operator when secure boot is enabled in the cluster host.
|
||||
Before you can create the virtual functions, you must first skip the firmware configuration for the Mellanox NIC and manually allocate the number of virtual functions in the firmware before switching the system to secure boot.
|
||||
* Support moving failed policy in resource injector to failed for SR-IOV operator
|
||||
|
||||
Description::
|
||||
SR-IOV enables physical functions (PFs) to be divided into multiple virtual functions (VFs).
|
||||
@@ -17,8 +15,7 @@ VFs can then be assigned to multiple pods to achieve higher throughput performan
|
||||
The SR-IOV Network Operator provisions and manages SR-IOV CNI, network device plugin, and other components of the SR-IOV stack.
|
||||
|
||||
Limits and requirements::
|
||||
* Only certain network interfaces are supported.
|
||||
See "Supported devices" for more information.
|
||||
* Only certain network interfaces are supported. See "Supported devices" for more information.
|
||||
|
||||
* Enabling SR-IOV and IOMMU: the SR-IOV Network Operator automatically enables IOMMU on the kernel command line.
|
||||
|
||||
@@ -33,7 +30,7 @@ Engineering considerations::
|
||||
* The `SriovOperatorConfig` CR must be explicitly created.
|
||||
This CR is included in the reference configuration policies, which causes it to be created during initial deployment.
|
||||
* NICs that do not support firmware updates with UEFI secure boot or kernel lockdown must be preconfigured with sufficient virtual functions (VFs) enabled to support the number of VFs required by the application workload.
|
||||
For Mellanox NICs, you must disable the Mellanox vendor plugin in the SR-IOV Network Operator.
|
||||
See "Configuring an SR-IOV network device" for more information.
|
||||
For Mellanox NICs, you must disable the Mellanox vendor plugin in the SR-IOV Network Operator. For more information see, "Configuring an SR-IOV network device".
|
||||
* To change the MTU value of a VF after the pod has started, do not configure the `SriovNetworkNodePolicy` MTU field.
|
||||
Instead, use the Kubernetes NMState Operator to set the MTU of the related PF.
|
||||
|
||||
|
||||
@@ -12,9 +12,9 @@ New in this release::
|
||||
Description::
|
||||
+
|
||||
--
|
||||
Cloud native storage services can be provided by {rh-storage-first} or other third-party solutions.
|
||||
Cloud native storage services can be provided by {rh-storage} or other third-party solutions.
|
||||
|
||||
{rh-storage} is a Ceph-based software-defined storage solution for containers.
|
||||
{rh-storage} is a Red Hat Ceph Storage based software-defined storage solution for containers.
|
||||
It provides block storage, file system storage, and on-premise object storage, which can be dynamically provisioned for both persistent and non-persistent data requirements.
|
||||
Telco core applications require persistent storage.
|
||||
|
||||
|
||||
@@ -7,10 +7,10 @@
|
||||
= Topology Aware Lifecycle Manager
|
||||
|
||||
New in this release::
|
||||
No reference design updates in this release.
|
||||
* No reference design updates in this release.
|
||||
|
||||
Description::
|
||||
{cgu-operator-full} is an Operator which runs only on the hub cluster.
|
||||
{cgu-operator} is an Operator which runs only on the hub cluster.
|
||||
{cgu-operator} manages how changes including cluster and Operator upgrades, configurations, and so on, are rolled out to managed clusters in the network.
|
||||
{cgu-operator} has the following core features:
|
||||
* Provides sequenced updates of cluster configurations and upgrades ({product-title} and Operators) as defined by cluster policies.
|
||||
@@ -19,8 +19,9 @@ Description::
|
||||
* Allows for per-cluster actions by adding `ztp-done` or similar user-defined labels to clusters.
|
||||
|
||||
Limits and requirements::
|
||||
* Supports concurrent cluster deployments in batches of 400.
|
||||
* Supports concurrent cluster deployments in batches of 400
|
||||
|
||||
Engineering considerations::
|
||||
* Only policies with the `ran.openshift.io/ztp-deploy-wave` annotation are applied by {cgu-operator} during initial cluster installation.
|
||||
* Any policy can be remediated by {cgu-operator} under control of a user created `ClusterGroupUpgrade` CR.
|
||||
|
||||
|
||||
@@ -39,26 +39,39 @@ $ mkdir -p ./out
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ podman run -it registry.redhat.io/openshift4/openshift-telco-core-rds-rhel9:v4.18 | base64 -d | tar xv -C out
|
||||
$ podman run -it registry.redhat.io/openshift4/openshift-telco-core-rds-rhel9:v4.19 | base64 -d | tar xv -C out
|
||||
----
|
||||
+
|
||||
You can view the reference configuration in the `reference-crs-kube-compare/` directory.
|
||||
You can view the reference configuration in the `out/telco-core-rds/configuration/reference-crs-kube-compare` directory by running the following command:
|
||||
+
|
||||
[source,terminal]
|
||||
----
|
||||
$ tree -L 2
|
||||
----
|
||||
+
|
||||
. Example output
|
||||
+
|
||||
[source,text]
|
||||
----
|
||||
out/telco-core-rds/configuration/reference-crs-kube-compare/
|
||||
.
|
||||
├── compare_ignore
|
||||
├── comparison-overrides.yaml
|
||||
├── metadata.yaml <1>
|
||||
├── optional <2>
|
||||
│ ├── logging
|
||||
│ ├── networking
|
||||
│ ├── other
|
||||
│ └── tuning
|
||||
└── required <3>
|
||||
├── networking
|
||||
├── other
|
||||
├── performance
|
||||
├── scheduling
|
||||
└── storage
|
||||
│ ├── logging
|
||||
│ ├── networking
|
||||
│ ├── other
|
||||
│ └── tuning
|
||||
├── ReferenceVersionCheck.yaml
|
||||
├── required <3>
|
||||
│ ├── networking
|
||||
│ ├── other
|
||||
│ ├── performance
|
||||
│ ├── scheduling
|
||||
│ └── storage
|
||||
├── unordered_list.tmpl
|
||||
└── version_match.tmpl
|
||||
|
||||
----
|
||||
<1> Configuration file for the reference configuration.
|
||||
<2> Directory for optional templates.
|
||||
|
||||
@@ -19,6 +19,19 @@ include::modules/telco-deviations-from-the-ref-design.adoc[leveloffset=+1]
|
||||
|
||||
include::modules/telco-core-common-baseline-model.adoc[leveloffset=+1]
|
||||
|
||||
include::modules/telco-core-deployment-planning.adoc[leveloffset=+1]
|
||||
|
||||
[role="_additional-resources"]
|
||||
.Additional resources
|
||||
|
||||
* xref:../updating/updating_a_cluster/update-using-custom-machine-config-pools.adoc#update-using-custom-machine-config-pools[Performing a canary rollout update]
|
||||
|
||||
* xref:../updating/updating_a_cluster/update-using-custom-machine-config-pools.adoc#update-using-custom-machine-config-pools[Applying MachineConfigPool labels to nodes before the update]
|
||||
|
||||
* xref:../nodes/pods/nodes-pods-configuring.adoc#nodes-pods-pod-distruption-about_nodes-pods-configuring[Understanding how to use pod disruption budgets to specify the number of pods that must be up]
|
||||
|
||||
* xref:../nodes/scheduling/nodes-scheduler-pod-affinity.adoc#nodes-scheduler-pod-affinity[Placing pods relative to other pods using affinity and anti-affinity rules]
|
||||
|
||||
include::modules/telco-core-cluster-common-use-model-engineering-considerations.adoc[leveloffset=+1]
|
||||
|
||||
include::modules/telco-core-application-workloads.adoc[leveloffset=+2]
|
||||
@@ -161,7 +174,7 @@ include::modules/telco-core-scheduling.adoc[leveloffset=+2]
|
||||
|
||||
* xref:../scalability_and_performance/cnf-numa-aware-scheduling.adoc#cnf-numa-aware-scheduling[Scheduling NUMA-aware workloads]
|
||||
|
||||
xref:../scalability_and_performance/using-cpu-manager.adoc#topology_manager_policies_using-cpu-manager-and-topology_manager[Topology Manager policies]
|
||||
* xref:../scalability_and_performance/using-cpu-manager.adoc#topology_manager_policies_using-cpu-manager-and-topology_manager[Topology Manager policies]
|
||||
|
||||
include::modules/telco-core-node-configuration.adoc[leveloffset=+2]
|
||||
|
||||
@@ -220,6 +233,8 @@ include::modules/using-cluster-compare-telco-core.adoc[leveloffset=+2]
|
||||
|
||||
include::modules/telco-core-crs-node-configuration.adoc[leveloffset=+2]
|
||||
|
||||
include::modules/telco-core-crs-cluster-infrastructure.adoc[leveloffset=+2]
|
||||
|
||||
include::modules/telco-core-crs-resource-tuning.adoc[leveloffset=+2]
|
||||
|
||||
include::modules/telco-core-crs-networking.adoc[leveloffset=+2]
|
||||
|
||||
Reference in New Issue
Block a user