There are many considerations when working on the recovery of business resources. In this video, you’ll learn about these important recovery thresholds and how to calculate the amount of uptime and availability.
<< Previous Video: Risks with Cloud Computing and VirtualizationNext: On-boarding and Off-boarding Business Partners >>
Unfortunately, problems occur. This might be a hardware failure, you might have data corruption inside of some software, or there might be an attack. And you will be expected to recover from these particular issues. One of the things you may be asked to determine is how long is it going to take to restore us back to where we were? This would be the mean time to restore, or MTTR. You might also hear this referred to as the mean time to repair. Sometimes that number’s very easy to determine. Sometimes there’s a lot more variables involved, so it requires that you make a larger estimation of what your MTTR might be.
There’s also a mean time to failure. Usually when things are running OK, you might want to consider how long is it going to be before something fails? It’s very common to do this with things like network infrastructure equipment. We know that hardware will eventually fail. So we want to determine how long can we expect this particular piece of hardware to run without a problem. And of course, we might have some secondary pieces of equipment or run pieces of equipment in tandem, so that if one fails, we’re able to take over and recover very quickly. That number between failures is the mean time to failure. That’s how long we can estimate that a particular device is going to run before it has a problem.
You don’t just have one device in your environment of course, you have many different kinds of devices. And so you want to get an idea of how long is it going to be before different failures occur. This would be our mean time between failures, or MTBF. This is obviously a prediction, but you can of course consider all of the different equipment you might have. How long it’s expected to run with its mean time to failure, and then give some particular idea of how long you can expect there to be between individual failures.
Another consideration is the recovery time objectives, or the RTO. You have to make a decision on how you’re able to recover to a certain service level. You may be able to recover and get people back up and running, but their data may not be available. Or you may decide to get all the way back up and running, and have the data available, but the time frame for doing both of those may be very different. So you have to calculate and determine how long it’s going to take to get back to every single recovery service level using that RTO.
When doing your business continuity planning, you generally have to take into account the RPO, or the recovery point objectives. For example, you may not be able to recover every bit of data once you have an outage. Maybe you’re only doing backups every day. So when everything comes back online, you have everything up to the last days of data. But maybe you were doing backups every five minutes, and you would only lose the last five minutes of data. And of course, these have costs associated with them, so business decisions have to be made on what an acceptable RPO might be.
When we start calculating availability, it’s usually based on the uptime of an application or uptime of an infrastructure. And it’s almost always referred to as a percentage of uptime. For example, 99.999% availability. You sometimes hear that referred to as five nines of availability.
But just how much availability is the right amount? That number is going to depend on your particular circumstances. Maybe your organization can handle more uptime or less uptime depending on what your services might be. If you’re a hospital, you want to have a large amount of up time. If you’re manufacturing that is not working during the evening hours, you can have more down time because you can do more maintenance during those off hours.
It can sometimes be a negotiated value because it ties into a bonus that you might get in your particular role. If you want to do the calculation of what a percentage uptime might be based on what the actual time is, it’s pretty interesting. If you have availability of 99.9999%, that means that your actual down time– usually this is over an entire year– is 32 seconds. That’s a pretty aggressive availability percentage. For five nines, you can only be down for five minutes and 15 seconds during the entire year. And an availability of 99% means that you were down for a maximum of 87 hours and 36 minutes over that entire time frame.
So you can see the difference between a 99% availability and a five nines of availability is a very, very large amount. And it also means that you have to have a lot of redundancy and a lot of planning that you’ve put in place to be able to say that you have five nines of availability.