Performing a risk analysis can assist in understanding recovery options, single points of failure, and the impact on the organization. In this video, you’ll learn about functional recovery plans, redundancy, disaster recovery plans, and more.
Nobody likes downtime to impact the business, but it’s important to quantify exactly what type of outage you’re having, and when you might expect to get back up and running. One of these quantifications is a recovery time objective, or RTO. This describes how long it would take to get back up and running to a particular service level. This doesn’t necessarily mean that we’re looking at a complete recovery time, but we are looking to get to a certain point and needing to know how much time that will take.
Often this is used in conjunction with an RPO, or a recovery point objective. This means that we would set an objective to meet a certain set of minimum requirements to get a system up and running. This means that part of it may be available, but part of it may also be unavailable. And the question might be, how much unavailable is acceptable? We need to be able to understand how much information we have available to us at any particular time. And if we bring the system back online, how far back, or how available will that data be?
If you’re trying to understand how long it would normally take to make this particular repair, then you’re looking for the MTTR, or the meantime to repair. This is a good estimate to help you understand how long it will take to get back up and running if there is an outage.
And another important statistic is the mean time between failures, or the MTBF. If there is a failure, how long will it be before the next failure occurs? This is often useful so that you can plan to have the right resources in place at the right time to be able to resolve these types of problems.
When an outage occurs, there needs to be a set of processes and procedures that can take us from the very beginning of resolving the issue all the way through to getting back up and running. This is a functional recovery plan, and it’s a step by step guide from going from an outage to being back up and running. There are a number of different items you would have in your functional recovery plan, but one of the more important ones is contact information for all the key players. You need to know who’s on call, who can address the particular problem that you happen to be having, and you need to have contact information of everyone involved so that you can keep everyone up to date.
Along with keeping the lines of communication open, we also need to understand the technical process we would go through to resolve this problem. This can be based on information provided in a knowledge base or we may already have a list of steps to follow to be able to get back up and running. And once we think we’re back up and running, it’s important to test the system to make sure that all of the things we assume are working properly really are working properly. So once we’ve tested the fix and we know it’s working as we expected, we can resume normal operation.
When you’re working with hardware, software, applications, and networks, any one device can bring down the entire system. To avoid this, we need to identify all single points of failure, and then find ways to remove those points of failure from the system. On the network side, we may add additional switches, or have redundant firewalls. In our facility, it might be useful to have a backup power system should we lose the main power. Or if an air conditioner breaks, it would be great to have a backup system that could also be used.
We also have to consider that people may be our single point of failure, especially if there’s a wide scale disaster and people are not able to get into the office. But even after looking at all of these single points of failure, there will be additional points of failure that you might want to address. At some point though, it becomes difficult, if not impractical to remove all single points of failure. Instead, you may have to work up to a certain point, and at that point, it becomes too costly to be able to remove the next single point of failure in your system.
A good example of having redundancy in your system to avoid any of those single points of failure might be if you have multiple connections to the internet, there might be multiple routers in front of those connections, there could be multiple firewalls. You could also have redundant routers on the inside of your network and redundant servers to pick up the load. All of these can work together. If you happen to lose a router, you’d be able to work through the secondary links and still maintain uptime and availability.
Every organization should have a disaster recovery plan, or a DRP. This is a plan that provides you with a step by step guide for resuming operations after a disaster has occurred. This particular plan could cover a number of different scenarios. It might be that we lose a single application or maybe we lose the entire data center. It could be that a hurricane comes through and we lose an entire region. In these disaster situations, we need to have a plan that will help us get back up and running as quickly as possible.
The truly successful disaster recovery plans have extensive work that’s been done prior to the disaster. You want to, of course, have backups ready to go, there might be data replication that’s handled at an offsite location, there may be alternatives that you have in the cloud ready to go if you happen to call a disaster, and you might have a secondary site where you can move all of your operations over to that different facility.
Many companies will work with a third party organization that specializes in disaster recovery. And they might have places around the world that you could set up a physical location and move your data center into their facility. They might also offer recovery services where they come on site and get back up and running at your own facility.
Disasters can create significant impacts on an organization. The most important consideration, of course, would be the life of the employees of the organization. You need to make sure that the disaster recovery plans that you have take into account the well-being of your employees. We also have to consider the risks to our buildings and the assets inside of those buildings.
If this is a natural disaster, there may be flooding, there might be electrical issues, or the structure of the building itself may be unstable. If that’s the situation, then we might have a safety issue where no one’s able to work in that particular environment. All disasters have some type of impact to the finances of an organization, and an organization that’s been through a disaster may have difficulty providing services to their customers, and therefore, might have problems with their reputation.
If you’re putting together a disaster recovery plan, it’s important to know the mission essential functions for your organization. So let’s say there was a natural disaster, and a flood or a hurricane was to take out your entire headquarters building, what functions would be the most essential for your organization and how would they be ranked? This is where we would start the analysis for disaster recovery, and we’ll determine step by step how we would get each one of those functions up and running. Once we get those business processes identified, we need to understand what technical resources we need to support those functions. The IT department should have documentation for each of those business functions, and if you don’t, it’s time to start creating them before the disaster hits.
The location of the disaster is also an important consideration. If you’re a manufacturing company, you might have a headquarters building and you might have a warehouse. Both of those have different types of risks associated with them, so you need to understand the applications required for those environments, the personnel you would need to get up and running at that facility, the equipment that you’d need to get back up and running, and the work environment that you would be in either at that location or a secondary location.