A disaster recovery plan isn’t very useful if it doesn’t work. In this video, you’ll learn how organizations test their recovery plans before an actual disaster occurs.
If an organization has created a disaster recovery plan, then they also need to test to be sure that they can execute on that plan. Recovery testing is usually something that’s done on a regular basis so that everyone understands the process and procedures that take place if an actual disaster is called. This recovery testing generally has a very specific scope, and we want to be sure that anything we do will not be affecting the actual production systems.
Usually there is a specific scenario that we would need to run through to recover, and we would have a time frame already set aside to be able to perform the testing. And once the test is done, we’ll need to evaluate how we perform for this particular kind of scenario and then make any changes to our recovery plan for the next test.
The problem with recovery testing is that it can be relatively involved and expensive to move everyone to a particular recovery site and build out a complete infrastructure only for a test. One way that you can minimize these costs is to go through the steps as if you were actually doing them. This is referred to as a tabletop exercise and allows us to go through the steps that we’ve already outlined for recovery system in a group of other people around a tabletop.
This way, we can work through all of the steps of our disaster recovery and work with other people in the organization to be assured that what we’re doing works perfectly with their recovery plans as well. And while this is not an actual recovery, it does allow you to look at the logistics that you’ve set aside for any real recovery exercises. And if there are any shortcomings or anything that you’ve missed in your existing recovery plans, you can often identify those during the tabletop exercise.
Another good recovery test is a failover test. We would use this test to see if our redundant configurations are able to switch over if there happens to be a failure. In an ideal failover test, the failover would happen automatically, users would be redirected behind the scenes to the appropriate resources, and they would have no idea that they were now running on the backup systems.
For this to work properly, we would need redundant systems to be able to perform the failover. These could be multiple switches, firewalls, routers, and anything else that might fail in your infrastructure. Many devices are already configured with the ability to failover. There are routers, firewalls, and switches that have failover functionality built into the software of those devices. We might also be able to use protocols or processes that can also provide for some level of failover.
Here’s a diagram that describes how failover could be designed in an infrastructure. We might have multiple internet connections from different providers. Those are connected to different routers inside of our organization, which of course connects to redundant firewalls. We would also have multiple switches, and on the back end we might have multiple links from the individual servers.
If we do have a break in the link for our primary connection because a switch, a firewall, or a router was to fail, we’ve always got a secondary link that we could use to gain access out to the internet. We could even extend this engineering of a failover by adding load balancers with multiple servers for each of these individual connections.
We might also want to test our security with simulations. This might be a simulation of a phishing attack, maybe a password reset process, or maybe we try removing data from the network to see if it’s identified by our automated systems. A simulation that many users are accustomed to are phishing simulations. We spend a lot of time training our users not to click links inside of email messages.
But there’s only one way to tell if they would click a link if it was presented to them, and that’s by performing a phishing simulation. We would create a phishing email attack that’s probably something that would be interesting for our users, and we would send that message to everyone in the company. Then we would sit back, watch our monitoring systems, and see if we can find someone clicking those links inside of the email message.
The first test of this is to see if our internal automated systems can even detect that that’s a phishing email. And if that email does make it to the users, we can see who clicked on the email, and perhaps we’ll need to assign additional training to those users. Another way to provide additional recovery options is by using parallel processing. This allows us to have multiple CPUs or processes that can be used simultaneously to process transactions.
This could be a single device that has multiple cores inside of it, or we may be using an infrastructure with multiple computers to provide this parallel processing. So instead of having a single CPU that’s handling a large number of complex transactions, we can spread those transactions across multiple CPUs. And that would allow us to complete those transactions much more efficiently.
This also provides a level of resiliency. If one of those processors happens to become unavailable, we can still spread the load across the remaining processors in our parallel processing infrastructure.