Disaster Recovery (DR) strategies come in various flavors, each serving a specific purpose and budgetary consideration. The decision to implement a cold, hot, or warm DR strategy depends on the criticality of the systems, recovery time objectives (RTO), and the available resources.
- Hot DR keeps a fully operational duplicate of the primary system, running at all times. This strategy offers the quickest recovery but is also the most expensive. It's suitable for mission-critical applications where even minimal downtime is unacceptable.
- Cold DR involves maintaining backups and infrastructure at a remote location, but it's not continuously powered on. It's a cost-effective choice for businesses with less urgent RTO requirements.
- Warm DR falls in between, with some systems pre-configured and ready to be powered up quickly. It balances cost and recovery time and is suitable for businesses with moderate RTO needs.
The choice of strategy is a careful balance between the criticality of data, budget constraints, and how quickly a business needs to resume operations after a disaster.
In this Article, we will be talking about the difference between Warm and Cold DR. Let's start by understanding some common terms
- Disaster: A sudden, catastrophic event causing significant disruption or distress, impacting normal functioning and posing a threat to an organization's operations, assets, or people. Disasters can be natural (e.g., earthquakes) or human-made (e.g., cyber-attacks).
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time, indicating the point in time to which systems and data must be recovered after a disaster.
- RTO (Recovery Time Objective): The maximum allowable downtime for a system or process after a disaster, defining the target time within which systems, applications, or services must be restored to normal operational status.
- WRT (Work Recovery Time): The time needed to fully recover and resume normal business operations after a disaster, encompassing both technical recovery and the time required for employees and processes to return to regular workflow.
- MTD (Maximum Tolerable Downtime): The maximum duration of time a business process or system can be offline before significant harm is done to the organization's objectives, reputation, or financial stability. MTD helps set RTO and RPO values.
- Fail-over: The automatic switching of operations from a failed system to a backup or secondary system to ensure uninterrupted service and prevent downtime.
- Fail-back: The process of returning operations from a backup or secondary system back to the primary system after restoration, ensuring a smooth transition and maintaining data consistency.
- Business Continuity (BC): The organization's ability to maintain essential functions and operations during and after a disaster or disruptive event, involving strategies like disaster recovery, risk management, and crisis management to ensure long-term viability.
- Unplanned Downtime: Unexpected periods during which systems, services, or processes are unavailable due to unforeseen events, leading to disruptions in normal business operations.
- Planned Downtime: Scheduled periods during which systems or services are intentionally made unavailable for maintenance, upgrades, or other planned activities, allowing organizations to perform necessary tasks without affecting critical operations.
Our Cloud Operations and Support Team has provided a simple diagram to put all the terms in play
In this diagram we can see multiple stages within a Cold DR declaration whereby
Before Disaster Strikes:
All business as usual where backup services are in place to carry out the necessary data backup within the schedule setup. Some of the considerations of a backup schedule to be designed will include the size of the data to be backed-up and also the tolerable data loss. Assuming if a backup task is scheduled to run once a day, the maximum data loss (RPO) an organization can suffer from will be up to 24 Hours.
During Disaster:
As soon as a disaster occurs, the focus will be ensuring business continuity by getting the DR site up. There is usually a preset DR “Playbook” that can be referred to whenever disaster happens. In a cold DR scenario, the relevant stakeholders will need to conduct a backup restoration to the virtual machines that are setup in the DR location. The duration from the start of the disaster until the completion of backup restoration and verification of backup restoration will reflect the RTO of the organization. As an example, if it takes 5 hours to restore a 2TB disk, 30minutes to verify the recovered copy, and finally 2 hours to arrange the necessary stakeholder with the access rights to access the DR site, the RTO of this operation will be 5Hours 30 Minutes whereas the Maximum Tolerable Downtime will be 7Hours 30 Minutes.
Business Resumes at DR:
As the production site recovers, business goes on as usual on the DR site where new data will now be stored within the DR site. This operations within DR site will not be running permanently and will have to fail-back to the Production site once Production site is restored. Backup will still be carried out based on required schedule.
Production Site Recovers:
Upon confirmation that the production site is ready for service, a planned maintenance period will have to be scheduled whereby all activities are expected to stop and the final backup activity will be carried out on the DR site. Once the backup is completed, it would be transferred back to the Production site and restoration will be carried out. The restored machine is then verified and business resumes in Production site
On the other hand, for a Warm DR declaration
Before Disaster Strikes:
All business as usual where on top of regular backup services, additional mechanisms need to be in place to achieve a shorter RTO. Example of mechanisms could include additional replication.
During Disaster:
Fail-over recovery will be carried out via additional tools that are in place and the recovered machines at the Disaster Recovery site will have to be verified.
Business Resumes at DR:
As the production site recovers, business goes on as usual on the DR site where new data will now be stored within the DR site. This operations within DR site will not be running permanently and will have to fail-back to the Production site once Production site is restored. Regular checksum checks will be carried out to ensure data integrity.
Production Site Recovers:
Upon confirmation that the production site is ready for service, a planned maintenance period will have to be scheduled whereby all activities are expected to stop and a reverse replication from the DR site to Production site will be carried out to reflect the latest data change. Once it is completed The restored machine is then verified and business resumes in Production site
Every organizations will have different Business Continuity Plans and objectives. With that in mind, the right DR strategy will have to be selected as different strategy requires different tools.