Disaster Recovery - SaaS

on-site-related topic

RunMyJobs allows cross-region disaster recovery (DR) by default.

To understand how cross-regional DR works with RunMyJobs, you need to understand a little bit about the RunMyJobs SaaS architecture. You must also understand RPO (Recovery Point Objective: the maximum amount of data loss that can be tolerated) and RTO (Recovery Time Objective: the amount of time it takes to recover from a disruptive event).

RunMyJobs SaaS Architecture

A RunMyJobs instance is tied to a particular AWS region. Within that region, it runs in a containerized, clustered environment.

Failover Within an AWS Region

An AWS region can include multiple Availability Zones (AZs), and RunMyJobs typically uses three AZs per region to ensure high availability within that region. If the AZ in which RunMyJobs is running goes down, the instance is automatically switched over to a different AZ, with an RPO of zero and a minimal RTO (due to the time it takes to spin up RunMyJobs in the new AZ).

Cross-Regional Failover

Every AWS region in which RunMyJobs runs has a designated secondary (failover) region. The secondary region is determined by Redwood and cannot be changed.

Hosting Primary Region Secondary Region
European Dublin Paris
European Paris Dublin
USA/Americas Oregon Ohio
USA/Americas Ohio Oregon
Germany Frankfurt Zurich
Germany Zurich Frankfurt
Asia Pacific Sydney Melbourne
Asia Pacific Melbourne Sydney
Asia Pacific Singapore Sydney

If a disaster (a region-wide sustained AWS outage with no ETA or a long ETA) occurs, all environments are brought up in the designated backup AWS region. Because the backup region is dedicated, the database and files are automatically synchronized, so that data and job processing losses are minimized.

A disaster must be declared by the Redwood Incident Response Team (IRT). If a region failure occurs, contact Redwood Support. They will perform a root cause analysis. Your site may remain down during this period.

Once the issue is located and verified, Redwood provides the following RPO and RTO values.

  • Standard tier: RPO = 8 hours, RTO =16 hours

  • Professional tier: RPO = 4 hours, RTO = 8 hours

  • Enterprise tier: RPO = 15 minutes, RTO =2 hours

The RTO timer starts when Redwood declares a disaster for the primary production region. The RPO timer starts at the last successful replication to the failover instance.

When the primary region recovers, production can be switched back (with a scheduled downtime that will be communicated by Redwood), unless the primary region is indefinitely unavailable.

Note: The cross-region DR feature is designed to address issues with entire regions, not with individual accounts and environments.

Note: Cross-regional disaster recovery cannot be tested by customers. If testing is required, Redwood can provide an internal disaster recovery test report to customers that are under an NDA.