openshift-docs/modules/policy-incident.adoc

// Module included in the following assemblies:
//
// * osd_architecture/osd_policy/policy-process-security.adoc

[id="policy-incident_{context}"]
= Incident and operations management


This documentation details the Red Hat responsibilities for the {product-title} managed service.

[id="platform-monitoring_{context}"]
== Platform monitoring
A Red Hat Site Reliability Engineer (SRE) maintains a centralized monitoring and alerting system for all {product-title} cluster components, SRE services, and underlying cloud provider accounts. Platform audit logs are securely forwarded to a centralized SIEM (Security Information and Event Monitoring) system, where they might trigger configured alerts to the SRE team and are also subject to manual review. Audit logs are retained in the SIEM for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted.

[id="incident-management_{context}"]
== Incident management
An incident is an event that results in a degradation or outage of one or more Red Hat services. An incident can be raised by a customer or Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.

Depending on the impact on the service and customer, the incident is categorized in terms of link:https://access.redhat.com/support/offerings/production/sla[severity].

The general workflow of how a new incident is managed by Red Hat:

. An SRE first responder is alerted to a new incident, and begins an initial investigation.
. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
. The incident lead manages all communication and coordination around recovery, including any relevant notifications or support case updates.
. The incident is recovered.
. The incident is documented and a root cause analysis is performed within 5 business days of the incident.
. A root cause analysis (RCA) draft document is shared with the customer within 7 business days of the incident.

[id="notifications_{context}"]
== Notifications
Platform notifications are configured using email. Any customer notification is also sent to the corresponding Red Hat account team and if applicable, the Red Hat Technical Account Manager.

The following activities can trigger notifications:

* Platform incident
* Performance degradation
* Cluster capacity warnings
* Critical vulnerabilities and resolution
* Upgrade scheduling

[id="backup-recovery_{context}"]
== Backup and recovery
All {product-title} clusters are backed up using cloud provider snapshots. Notably, this does not include customer data stored on persistent volumes (PVs). All snapshots are taken using the appropriate cloud provider snapshot APIs and are uploaded to a secure object storage bucket (S3 in AWS, and GCS in Google Cloud) in the same account as the cluster.

//Verify if the corresponding tables in rosa-sdpolicy-platform.adoc and rosa-policy-incident.adoc also need to be updated.

[cols= "3a,2a,2a,3a",options="header"]

|===
|Component
|Snapshot frequency
|Retention
|Notes

.2+|Full object store backup
|Daily
|7 days
.2+|This is a full backup of all Kubernetes objects like etcd. No PVs are backed up in this backup schedule.

|Weekly
|30 days


|Full object store backup
|Hourly
|24 hour
|This is a full backup of all Kubernetes objects like etcd. No PVs are backed up in this backup schedule.

|Node root volume
|Never
|N/A
|Nodes are considered to be short-term. Nothing critical should be stored on a node's root volume.

|===

* Red Hat does not commit to any Recovery Point Objective (RPO) or Recovery Time Objective (RTO).
* Customers are responsible for taking regular backups of their data
* Customers should deploy multi-AZ clusters with workloads that follow Kubernetes best practices to ensure high availability within a region.
* If an entire cloud region is unavailable, customers must install a new cluster in a different region and restore their apps using their backup data.

[id="cluster-capacity_{context}"]
== Cluster capacity
Evaluating and managing cluster capacity is a responsibility that is shared between Red Hat and the customer. Red Hat SRE is responsible for the capacity of all control plane and infrastructure nodes on the cluster.

Red Hat SRE also evaluates cluster capacity during upgrades and in response to cluster alerts. The impact of a cluster upgrade on capacity is evaluated as part of the upgrade testing process to ensure that capacity is not negatively impacted by new additions to the cluster. During a cluster upgrade, additional worker nodes are added to make sure that total cluster capacity is maintained during the upgrade process.

Capacity evaluations by SRE staff also happen in response to alerts from the cluster, once usage thresholds are exceeded for a certain period of time. Such alerts can also result in a notification to the customer.