mirror of
https://github.com/openshift/openshift-docs.git
synced 2026-02-05 12:46:18 +01:00
Merge pull request #90947 from openshift-cherrypick-robot/cherry-pick-90613-to-enterprise-4.19
[enterprise-4.19] [OSDOCS#13195]:Updated incident management section of RACI doc with more info.
This commit is contained in:
@@ -102,20 +102,24 @@ Platform audit logs are securely forwarded to a centralized security information
|
||||
|
||||
[id="rosa-policy-incident-management_{context}"]
|
||||
== Incident management
|
||||
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services. An incident can be raised by a customer or a Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
|
||||
An incident is an event that results in a degradation or outage of one or more Red{nbsp}Hat services, and can affect service-level agreements (SLAs).
|
||||
|
||||
Customers and Customer Experience and Engagement (CEE) members can raise an incident through a support case. The centralized monitoring and alerting system and members of the SRE team can also raise an incident directly.
|
||||
|
||||
Depending on the impact on the service and customer, the incident is categorized in terms of link:https://access.redhat.com/support/offerings/production/sla[severity].
|
||||
|
||||
Red{nbsp}Hat either sends out cluster notifications to affected individual clusters or changes the status at link:https://status.redhat.com[status.redhat.com] to reflect a wider incident. Cluster notifications are not sent for low-impact events, low-risk security updates, routine operations and maintenance, or minor, transient issues that are quickly resolved by SRE.
|
||||
|
||||
When managing a new incident, Red{nbsp}Hat uses the following general workflow:
|
||||
|
||||
. An SRE first responder is alerted to a new incident and begins an initial investigation.
|
||||
. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
|
||||
. An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates.
|
||||
. An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates. If the status of a service changes or if Red{nbsp}Hat has a significant update on the progress, then the incident lead sends out an updated cluster notification.
|
||||
. The incident is recovered.
|
||||
. The incident is documented and a root cause analysis (RCA) is performed within 5 business days of the incident.
|
||||
. An RCA draft document will be shared with the customer within 7 business days of the incident.
|
||||
|
||||
Red{nbsp}Hat also assists with customer incidents raised through support cases.
|
||||
Red{nbsp}Hat also assists with customer incidents raised through support cases.
|
||||
Red{nbsp}Hat can assist with activities including but not limited to:
|
||||
|
||||
* Forensic gathering, including isolating virtual compute
|
||||
|
||||
@@ -33,6 +33,11 @@ include::modules/managed-cluster-notification-policy.adoc[leveloffset=+2]
|
||||
//---
|
||||
|
||||
include::modules/rosa-policy-incident.adoc[leveloffset=+1]
|
||||
|
||||
[role="_additional-resources"]
|
||||
.Additional resources
|
||||
* xref:../../rosa_cluster_admin/rosa-cluster-notifications.adoc#rosa-cluster-notifications[Cluster notifications]
|
||||
|
||||
include::modules/rosa-policy-change-management.adoc[leveloffset=+1]
|
||||
|
||||
[role="_additional-resources"]
|
||||
|
||||
Reference in New Issue
Block a user