1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00
Files
openshift-docs/modules/policy-incident.adoc

92 lines
5.7 KiB
Plaintext

// Module included in the following assemblies:
//
// * osd_architecture/osd_policy/policy-process-security.adoc
:_mod-docs-content-type: REFERENCE
[id="policy-incident_{context}"]
= Incident and operations management
[role="_abstract"]
This documentation details the Red{nbsp}Hat responsibilities for the {product-title} managed service.
The cloud provider is responsible for protecting the hardware infrastructure that runs the services offered by the cloud provider.
The customer is responsible for incident and operations management of customer application data and any custom networking the customer has configured for the cluster network or virtual network.
[id="platform-monitoring_{context}"]
== Platform monitoring
A Red{nbsp}Hat Site Reliability Engineer (SRE) maintains a centralized monitoring and alerting system for all {product-title} cluster components, SRE services, and underlying cloud provider accounts. Platform audit logs are securely forwarded to a centralized SIEM (Security Information and Event Monitoring) system, where they might trigger configured alerts to the SRE team and are also subject to manual review. Audit logs are retained in the SIEM for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted.
[id="incident-management_{context}"]
== Incident management
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services.
An incident can be raised by a customer or Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
Depending on the impact on the service and customer, the incident is categorized in terms of link:https://access.redhat.com/support/offerings/production/sla[severity].
When managing a new incident, Red{nbsp}Hat uses the following general workflow:
. An SRE first responder is alerted to a new incident, and begins an initial investigation.
. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
. The incident lead manages all communication and coordination around recovery, including any relevant notifications or support case updates.
. When the incident is resolved a brief summary of the incident and resolution are provided in the customer-initiated support ticket. This summary helps the customers understand the incident and its resolution in more detail.
If customers require more information in addition to what is provided in the support ticket, they can request the following workflow:
. The customer must make a request for the additional information within 5 business days of the incident resolution.
. Depending on the severity of the incident, Red{nbsp}Hat may provide customers with a root cause summary, or a root cause analysis (RCA) in the support ticket. The additional information will be provided within 7 business days for root cause summary and 30 business days for root cause analysis from the incident resolution.
Red{nbsp}Hat also assists with customer incidents raised through support cases.
Red{nbsp}Hat can assist with activities including but not limited to:
* Forensic gathering, including isolating virtual compute
* Guiding compute image collection
* Providing collected audit logs
[id="backup-recovery_{context}"]
== Backup and recovery
All {product-title} clusters are backed up using cloud provider snapshots. Notably, this does not include customer data stored on persistent volumes (PVs). All snapshots are taken using the appropriate cloud provider snapshot APIs and are uploaded to a secure object storage bucket (S3 in AWS, and GCS in {gcp-full}) in the same account as the cluster.
//Verify if the corresponding tables in rosa-sdpolicy-platform.adoc and rosa-policy-incident.adoc also need to be updated.
[cols= "3a,2a,2a,3a",options="header"]
|===
|Component
|Snapshot frequency
|Retention
|Notes
.2+|Full object store backup
|Daily
|7 days
.2+|This is a full backup of all Kubernetes objects like etcd. No PVs are backed up in this backup schedule.
|Weekly
|30 days
|Full object store backup
|Hourly
|24 hour
|This is a full backup of all Kubernetes objects like etcd. No PVs are backed up in this backup schedule.
|Node root volume
|Never
|N/A
|Nodes are considered to be short-term. Nothing critical should be stored on a node's root volume.
|===
* Red Hat does not commit to any Recovery Point Objective (RPO) or Recovery Time Objective (RTO).
* Customers are responsible for taking regular backups of their data
* Customers should deploy multi-AZ clusters with workloads that follow Kubernetes best practices to ensure high availability within a region.
* If an entire cloud region is unavailable, customers must install a new cluster in a different region and restore their apps using their backup data.
[id="cluster-capacity_{context}"]
== Cluster capacity
Evaluating and managing cluster capacity is a responsibility that is shared between Red Hat and the customer. Red Hat SRE is responsible for the capacity of all control plane and infrastructure nodes on the cluster.
Red Hat SRE also evaluates cluster capacity during upgrades and in response to cluster alerts. The impact of a cluster upgrade on capacity is evaluated as part of the upgrade testing process to ensure that capacity is not negatively impacted by new additions to the cluster. During a cluster upgrade, additional worker nodes are added to make sure that total cluster capacity is maintained during the upgrade process.
Capacity evaluations by SRE staff also happen in response to alerts from the cluster, once usage thresholds are exceeded for a certain period of time. Such alerts can also result in a notification to the customer.