// Module included in the following assemblies:
//
// * rosa_architecture/rosa_policy_service_definition/rosa-policy-shared-responsibility.adoc

:_mod-docs-content-type: CONCEPT
[id="rosa-policy-disaster-recovery_{context}"]
= Disaster recovery
Disaster recovery includes data and configuration backup, replicating data and configuration to the disaster recovery environment, and failover on disaster events.

{product-title} (ROSA) provides disaster recovery for failures that occur at the pod, node, and availability zone levels.

All disaster recovery requires that the customer use best practices for deploying highly available applications, storage, and cluster architecture, such as multiple machine pools across multiple availability zones, to account for the level of desired availability.

One cluster with a single machine pool will not provide disaster avoidance or recovery in the event of an availability zone or region outage. Multiple clusters with single machine pools with customer-maintained failover can account for outages at the zone or at the regional level.

One cluster with multiple machine pools across multiple availability zones will not provide disaster avoidance or recovery in the event of a full region outage. Multiple clusters in several regions with multiple machine pools in more than one availability-zone with customer-maintained failover can account for outages at the regional level.

[cols="2a,3a,3a" ,options="header"]
|===
|Resource
|Service responsibilities
|Customer responsibilities

|Virtual networking management
|**Red{nbsp}Hat**

- Recreate affected virtual network components that are necessary for the platform to function.
|- Configure virtual networking connections with more than one tunnel where possible for protection against outages as recommended by the public cloud provider.
- Maintain failover DNS and load balancing if using a global load balancer with multiple clusters.
//waiting to hear back from Will G. if RH responsibilities below should have just been conditionalized out per the migration review for hcp or Classic as well.
|Virtual Storage management
|**Red{nbsp}Hat**
ifdef::openshift-rosa[]
- For ROSA clusters created with IAM user credentials, back up all Kubernetes objects on the cluster through hourly, daily, and weekly volume snapshots. Hourly backups are retained for 24 hrs (1 day), daily backups are retained for 168 hrs (1 week), and weekly backups are retained for 720 hrs (30 days).
endif::openshift-rosa[]

|- Back up customer applications and application data.

|Virtual compute management
|**Red{nbsp}Hat**
ifdef::openshift-rosa[]
- Monitor the cluster and replace failed Amazon EC2 control plane or infrastructure nodes.
endif::openshift-rosa[]
- Provide the ability for the customer to manually or automatically replace failed worker nodes.

|- Replace failed Amazon EC2 worker nodes by editing the
machine pool configuration through OpenShift Cluster Manager or the ROSA CLI.

|AWS software (public AWS services)
|**AWS**

**Compute:** Provide Amazon EC2 features that support data resiliency such as Amazon EBS snapshots and Amazon EC2 Auto Scaling. For more information, see link:https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/disaster-recovery-resiliency.html[Resilience in Amazon EC2] in the EC2 User Guide.

**Storage:** Provide the ability for the ROSA service
and customers to back up the Amazon EBS volume on the cluster through Amazon EBS volume snapshots.

**Storage:** For information about Amazon S3 features that support data resiliency, see link:https://docs.aws.amazon.com/AmazonS3/latest/userguide/disaster-recovery-resiliency.html[Resilience in Amazon S3].

**Networking:** For information about Amazon VPC features that support data resiliency, see link:https://docs.aws.amazon.com/vpc/latest/userguide/disaster-recovery-resiliency.html[Resilience in Amazon Virtual Private
Cloud] in the Amazon VPC User Guide.

|- Configure ROSA clusters with multiple machine pools across multiple availability zones to improve fault tolerance and cluster availability.

- Provision persistent
volumes using the
Amazon EBS CSI
driver to enable
volume snapshots.

- Create CSI volume snapshots of Amazon
EBS persistent volumes.
|Hardware/AWS global infrastructure
|**AWS**

- Provide AWS global infrastructure that allows ROSA to scale
ifdef::openshift-rosa[]
control plane, infrastructure, and worker nodes
endif::openshift-rosa[]
ifdef::openshift-rosa-hcp[]
nodes
endif::openshift-rosa-hcp[]
across
Availability Zones. This functionality enables ROSA to orchestrate automatic failover between zones without interruption.

- For more information about disaster recovery best practices, see link:https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-options-in-the-cloud.html[Disaster recovery options in the cloud] in the AWS
Well-Architected Framework.

|- Configure ROSA clusters with multiple machine pools across multiple availability zones to improve fault tolerance and cluster availability.

|===