A disaster can be small and large events, and you need to be ready for anything. In this video, you’ll learn about disaster recovery planning and some strategies for dealing with disaster events.
<< Previous Video: Continuity of OperationsNext: IT Contingency Planning >>
When we hear the words disaster recovery, we often think about the big problems– the hurricanes, the fires, the tornadoes. The reality is there could certainly be big problems, but very often there are smaller problems as well. If a water pipe bursts in your facility and that pipe is going to damage systems that you might have in your data center, that becomes a bit of a security problem as well, becomes a bit of a disaster that has to be recovered from.
We will often manage these disaster recovery systems through a third party, or we’ll include a third party through this. Maybe we will contract with a data center facility that sits there. And if we ever call a disaster, that data center will be available to us. And we can drive there. It’s probably in a geographically diverse location. In case there’s something that happens over a large geographical area, we can go to a different city and bring up a disaster recovery data center in that different city.
Generally, we call a disaster. Disaster has occurred. And there’s a set of processes and procedures for calling that disaster, because when you start the disaster recovery process, there’s some costs that are associated with that. So we’re dispatching people.
At that point everything goes into action. We look at our plan of attack. We’ve gone through all the things that we’ve been planning for, and now we’re actually executing on it. We have to be able to think on our feet, because when a disaster occurs, things might happen that we would have never thought of. We may have realized that we’re going to need to have generators if we lose power, but we may not have realized that the power outage would be so massive that we would not be able to get gasoline for the generators because the gas stations don’t have power to be able to pump the gasoline.
So all of these things work together, and sometimes you have to think about how am I going to get gasoline. How am I going to be able to pump some of these things out? Sometimes those unknown things can really bite you. And you have to be able to move and change as you go, especially in the middle of a disaster.
If you’re going to recover from a situation, you’re going to want to test prior to that. You don’t want your first time going through to be the incident. And so you’re going to plan, and you’re going to test for this. Unless you try it, you’ll never know what you missed or things you need to add or modify on the plan.
Many people will schedule these tests. They’ll do them once a month. They’ll do them once a year, and they’ll go through the process of understanding how do our processes work. Are they as effective and efficient as we expect them to be or that we need them to be?
So we’ll create a scenario. Let’s say we lose the building. Let’s say this server goes down. Let’s say a database crashes. What do you do?
And they go through the process of grabbing the backups, loading them on some new hardware, obtaining new hardware, maybe finding a new building with a new internet connection, maybe failing over to our redundant facility. We want to include as much of the organization as possible during these tests, but we don’t want to affect any of the currently production systems going on. You obviously would not fail a real server that’s being used on your production network to provide your services to your end users.
You also want to think about documenting this, especially during the testing phase. Should an actual emergency occur and cause a downtime to your business, that’s not the time to go through and think if you’re doing this right and maybe we should change this for next time. You want to do that during your planning phase and during your testing phase.
And afterwards you’ll be able to look at that list and say, what worked during our test. Or what did not work during our test? Do we need to change our processes? Do we need to buy additional resources, or do we need to think differently about how we’re going to handle this problem should it occur?