1
0
mirror of https://github.com/openshift/openshift-docs.git synced 2026-02-05 12:46:18 +01:00

Merge pull request #90947 from openshift-cherrypick-robot/cherry-pick-90613-to-enterprise-4.19

[enterprise-4.19] [OSDOCS#13195]:Updated incident management section of RACI doc with more info.
This commit is contained in:
Servesha Dudhgaonkar
2025-03-24 18:55:29 +05:30
committed by GitHub
2 changed files with 12 additions and 3 deletions

View File

@@ -102,20 +102,24 @@ Platform audit logs are securely forwarded to a centralized security information
[id="rosa-policy-incident-management_{context}"]
== Incident management
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services. An incident can be raised by a customer or a Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services, and can affect service-level agreements (SLAs).
Customers and Customer Experience and Engagement (CEE) members can raise an incident through a support case. The centralized monitoring and alerting system and members of the SRE team can also raise an incident directly.
Depending on the impact on the service and customer, the incident is categorized in terms of link:https://access.redhat.com/support/offerings/production/sla[severity].
Red{nbsp}Hat either sends out cluster notifications to affected individual clusters or changes the status at link:https://status.redhat.com[status.redhat.com] to reflect a wider incident. Cluster notifications are not sent for low-impact events, low-risk security updates, routine operations and maintenance, or minor, transient issues that are quickly resolved by SRE.
When managing a new incident, Red{nbsp}Hat uses the following general workflow:
. An SRE first responder is alerted to a new incident and begins an initial investigation.
. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
. An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates.
. An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates. If the status of a service changes or if Red{nbsp}Hat has a significant update on the progress, then the incident lead sends out an updated cluster notification.
. The incident is recovered.
. The incident is documented and a root cause analysis (RCA) is performed within 5 business days of the incident.
. An RCA draft document will be shared with the customer within 7 business days of the incident.
Red{nbsp}Hat also assists with customer incidents raised through support cases.
Red{nbsp}Hat also assists with customer incidents raised through support cases.
Red{nbsp}Hat can assist with activities including but not limited to:
* Forensic gathering, including isolating virtual compute

View File

@@ -33,6 +33,11 @@ include::modules/managed-cluster-notification-policy.adoc[leveloffset=+2]
//---
include::modules/rosa-policy-incident.adoc[leveloffset=+1]
[role="_additional-resources"]
.Additional resources
* xref:../../rosa_cluster_admin/rosa-cluster-notifications.adoc#rosa-cluster-notifications[Cluster notifications]
include::modules/rosa-policy-change-management.adoc[leveloffset=+1]
[role="_additional-resources"]