It’s important to plan for any contingency. In this video, you’ll learn how to measure outages, options for site resiliency, and how to prepare and test for a disaster scenario.
Most organizations will create a plan that they will follow if they happen to have an outage or significant problem that could affect the overall goals of the organization. This is often referred to as the Disaster Recovery Plan, or the DRP, and it will cover every aspect and detail of how to handle these situations when they occur.
And if you think of all of the different types of technologies involved in recovering from a disaster, it is many and varied. You have backups that you need to consider. There may be offsite data replication that might be involved in this disaster recovery. Perhaps you set up alternatives that are located in the cloud. So instead of having a server on site, we can create that same server in a cloud-based environment. Or maybe you’ve got a completely separate remote site, and you move all of your operations to that fully operating remote location.
There are also many third parties that can provide additional services for disaster recovery. For example, you can contract with a third party that will provide you with a location so that you can move your operations to this temporary facility. Or you may want to contract with recovery services that can come into your organization and manage the process of recovering from a disaster.
There are many different metrics that can help us understand the scope and the breadth of an outage. One of these metrics is a Recovery Time Objective, or an RTO. This is an amount of time, and we want this RTO to be as close to zero as possible. RTO is a measurement of how quickly we can get back up and running if an outage is to occur. So we need to define what a normal service level would be, and then we can calculate how long it will take to get to that particular service level. We refer to that gap in time as a recovery time objective.
For example, we know that if our web server was to fail that the normal recovery time objective for that web server becoming available again is approximately one hour. So one hour would be the RTO for that outage.
Another useful measurement is an RPO. This is a Recovery Point Objective. This is also measured as an amount of time. And again, we would like that RPO to be as close to zero as possible.
RPO represents how much time we lost when that outage occurred. We have a certain amount of data that goes back in time that was not stored, it was not backed up. And when that outage occurs, we would lose all of that data.
Generally, this is a value that you’ve already determined. You would make this determination based on what resources you have to recover this data and the types of backups that you may be doing throughout the day. And this RPO may be different depending on the type of business you’re in. For example, if your organization handles banking transactions or you manage patient information, you want to be sure that you don’t lose a lot of that data.
So you may put methods in place to be sure that you are only losing a small amount of time when an outage occurs. This might be, for example, a very short time frame that would be less than an hour.
But maybe your organization works with other types of data that don’t require such a short RPO. For example, any updates to your website or updates to internal documents may only be backed up every hour or two hours. So there would be a longer RPO associated with that type of data.
If we were to look at both of these values on a timeline, you can see that as time is going by, we have a certain data recovery point. Maybe this is where data is backed up. Maybe we are replicating data to a different site, but we are making a copy of that data or storing that data in a way that we could recover later. Sometime after that point, we would have an outage, and the time that we have between that outage and the back-in-time period to that data recovery point would be the Recovery Point Objective, or the RPO.
So now we’re focused on resolving this outage. We need to resolve the issue with the servers, deploy new servers in a different location, move the data center to a backup site, or do whatever we need to do to recover from this disaster. And then finally when our services are back online, we can measure that time frame between the outage and the online time frame as a Recovery Time Objective, or an RTO.
When these issues occur, it’s useful to know how long it’s going to be to resolve this problem, and it might also be good to be able to predict when problems might occur. We can provide estimates for both of those values by using MTTR and MTBF. MTTR is the Mean Time to Repair. That is, on average, how long it will take to resolve the issue associated with that particular problem. Maybe it’s a router that’s failed, and we need to replace that router to get back up and running. The average time frame for replacing that router would be our mean time to repair.
A better plan might be to use systems that are designed to stay up and running as long as possible. So we would need to put in equipment that would have a very long Mean Time Between Failures, or MTBF. The MTBF is generally based on a number of criteria, but it’s presented as a single time frame. For example, if you purchase a firewall, that firewall’s MTBF may be 20 years before you would expect that device to fail.
So you can then make disaster-recovery plans around that time frame. If you see that your firewall has an MTBF of 20 years, you might only need one additional backup unit instead of purchasing multiples because you know that device is going to last, on average, a relatively long period of time.
If we’re dealing with a significant disaster, we may need to move out of our existing data center and into a temporary facility, but making that change is often not a simple process. There are a lot of different moving parts and things that have to happen to move your entire data center from one location to another. And then once you’ve moved to that other location, you then have to move everything back once you’re up and running again. We refer to this moving of one location to another as site resiliency.
One example of this site resiliency may be the process you go through to prepare that disaster-recovery site. You need to make sure you have power. You may need to bring in additional hardware and have it staged prior to a disaster, and you may want to have data that’s transferred over.
Once that disaster occurs, you will have a process where you move from your primary location to this backup facility. You would then work from this backup facility until the problem is resolved at the main location. This might take an hour, it might take a day, or it could take months, depending on the problem that’s occurred.
Every disaster will be different, and we have to think about that time frame when we’re preparing the alternate site. And, of course, when we are ready to move back to the original data center, there is another process where we would take all of our assets and our data and move it back to the original location.
If you’re going to be using a separate disaster-recovery site, there are a number of different ways to set up this particular facility. One is with a cold site. This is effectively an empty building. None of our equipment is in this building, and none of our data currently resides at this location. We need to grab backup tapes or equipment that has our data and move it to this location to be able to perform disaster recovery.
We also don’t have any people at this location, so we may need to transport people from one location to another so that they can work at this physical site. This obviously means we have a lot of work to do if we call a disaster, but it also means that this is a relatively inexpensive place to use as a backup location.
If you needed to have a disaster site where you could very easily move in and be up and running, you may want to consider a hot site. A hot site is an exact replica of your data center, or as exact as you can make it for something that’s a disaster-recovery location. This means it has the same hardware that you are currently using in your existing data center, and it’s very common that when you’re purchasing new equipment for a data center that you also purchase additional units for your hot site.
And, of course, not only do we need equipment at this hot site, we also need our applications and our data. It’s very common to have replication systems or ongoing backups so that the data at your hot site is as close to the data that you’re running at your primary location. And by putting all of this in place at the hot site, we can then move relatively quickly from our primary data center to this disaster-recovery location. And since we don’t have to install any hardware, install any applications, or recover data from backups, we can be up and running relatively quickly.
A warm site would be somewhere in the middle between a cold site and a hot site. This is a site that might have some level of infrastructure, that might have power and racks and, in some cases, might even have additional hardware that you could use, and so all you would need to do is show up with your data, recover from your backup tapes, and you would be up and running. There’s different levels of service that you can use for a warm site, so you can decide just how much hardware or how much data you would like to have staged at that location.
It’s always a good idea to practice and run through tests so you know exactly what to do should a disaster occur, but that process of going through a full-blown disaster-recovery test can be relatively costly. This will take people out of their normal jobs and, perhaps in some cases, send them to a separate location to be able to perform the actual disaster-recovery test.
Instead of going through a full-scale test, it might be useful to have everyone sit around a conference table and go through the process that they would follow if a disaster was called. This would allow all of the different departments in IT and all of the management of the company to step through simulated problems and describe what they would do in those particular situations.
This means we don’t have to go through the physical process of going to get our backups and taking them to this remote location, but it does require that everyone step through the process and see if all the logistics are in place to get that data from one location to another.
Since we’re all sitting around a conference room table to be able to describe the process that we would follow, we refer to this as a tabletop exercise. This is a meeting that you can go through in about a day or two, and all the key players will be participating and stepping through the process that they will follow if a disaster is called.
There are some organizations, though, that go through a full-blown disaster-recovery site either once a year or multiple times a year, so it’s always good to go through these validation tests. That way if a disaster occurs, everyone in the organization will know exactly what to do. When you’re running through these validation tests, you’re obviously not actually moving from a production environment into your disaster environment, but you are going through exactly the same process.
Usually these validation tests will follow a particular scenario. For example, let’s say a fire was to destroy the building where your primary data center existed. There would be a series of processes and steps to be able to move everything to the disaster-recovery site.
But what if the scenario was different where everyone in your geographic area around that data center had to be evacuated? In that situation, there might be a different set of steps to be able to get all of your applications and all of your data from your primary location to that backup site.
Once everyone steps through this scenario and goes through the entire disaster-recovery test, we can document what worked and the things that need to be fixed for later. This allows us to make ongoing improvements so that we know exactly how to be more efficient when moving everyone to a disaster-recovery site.