Resiliency - CompTIA Security+ SY0-701 - 3.4 - Professor Messer IT Certification Training Courses

It can be challenging to maintain uptime and availability of our modern networks. In this video, you’ll learn about server clustering, load balancing, site resiliency, multi-cloud systems, and more.

<< Previous Video: Protecting Data

Next Video: Capacity Planning >>

If you work in information security, then you’re very focused on maintaining the uptime and availability of systems. And one way to provide that level of resiliency is through the use of high availability. To plan for outages, some administrators will purchase multiple components so that if one component fails, they can easily bring in the other component to replace it.

But that doesn’t necessarily mean that everything remains up and running. It may take time to pull the secondary device out of the box, put it into the rack, give it a configuration. And then finally, you’re up and running. Instead, we can provide more enhanced availability with HA, or High Availability. In this configuration, everything would be up and running and always turned on and always available.

If one system fails, you have another system running at the same time right beside it. And that other system will take up the additional load and continue to provide availability for that particular service. But of course, having additional components and having those systems always turned on and always operating means that there will also be additional costs.

When we think about engineering for high availability, not only do we purchase multiple systems. We may need upgraded power for all of those systems. Maybe the components within those systems have a higher quality. And because of this, overall, everything will have a higher cost.

Another way to provide this resiliency is through server clustering. This is when you can have multiple servers configured to all work together as one big server. And for the folks using these resources, they don’t see individual servers. They simply see one single server cluster.

This clustering also provides some scalability options because you can add and remove devices from the cluster in real time, either increasing or decreasing the capacity as needed. This clustering functionality is generally turned on within the operating system of the server itself. And usually, all of the servers are running identical operating systems to provide that interoperability.

This is one design for a server cluster, where your users are down here at the bottom, communicating through a switch. And then that switch is connected to the entire server cluster. To be able to maintain synchronization between all of these servers, they don’t write information to their local drive. Instead, very commonly, there’s a separate shared storage directory. And they will all use that single shared storage so that all of those servers always have up-to-date data that they’re referencing.

A resiliency method very similar to server clustering is load balancing. Load balancing uses one central load balancer to be able to distribute the load between individual servers. Unlike server clustering, where each individual server in the cluster knows of all of the other servers in the cluster, load balancing works very differently because each of these servers have no idea that the other server even exists.

The load balancer acts as that central point that manages what devices receive what requests. So it will distribute the load across all of the individual multiple devices. These devices could be running the same operating system or a different operating system. It doesn’t matter because the load balancer is the one maintaining and managing the load across all of those systems.

And similar to server clustering, you can add and remove devices from the load balancer as you need. So if you need additional capacity on your network, you simply add additional servers to the load balancer. This also works if a server fails. The load balancer will automatically identify a server that is no longer working, administratively remove it from the load balancer, and then simply spread the load against the remaining servers.

We can also spread this resiliency out to physical locations. This is site resiliency. And you might have a recovery site that has already been allocated in case of a disaster. All of your data is synchronized at that site. And you’re simply waiting for a problem to occur.

When that disaster is called, a business could move their entire data center to a completely different location that would be physically separated from the disaster. This allows all of your normal operations to continue for the duration of that particular event. You might only switch over to the recovery site for a number of hours. Or you may be using that recovery site for weeks at a time. Once the disaster event is over, you can then move from the recovery site back to your original data center.

One type of recovery site is a hot site. A hot site is an exact replica of your data center. It contains all of the hardware that you’re running in your data center, which also means, of course, that every time you purchase for your existing data center, you also purchase for the hot site.

We also keep all of our applications updated at the hot site. And all of our data is constantly synchronized to the hot site. This means that any time if we need to move all of our resources to the hot site, it will contain an exact duplicate of our existing data center.

For some organizations, a disaster may be a relatively rare circumstance. So they may not require the needs of a hot site but instead might consider a cold site. A cold site means it’s an empty building. There’s, of course, power and lighting. But you’ll need to bring all of your data so that you have something that you can reference, all of your equipment, and, of course, all the people required to run this particular site.

And for some organizations, a better fit might be a warm site. This is a midrange between a cold site and a hot site, where there is some equipment on site. And perhaps some of the data is available. But you’ll need to bring additional hardware and additional data to cover anything that may not be available.

For most organizations, their recovery site is going to be at a location that is a significant distance away. This geographical dispersion means that if something physical happens to your primary location, it’s very likely your recovery site will not be affected. For example, let’s say there is a storm that affects a very large area such as a hurricane or a flood. That would certainly affect a recovery site that was located down the street from your main location.

But if your recovery site was in a different state, it’s very unlikely that that particular storm may be affecting both of those at the same time. As you can imagine, it becomes more challenging to have a recovery site that is farther and farther away. We have to think about how we’re going to get equipment from our location to the recovery site.

There will probably need to be employees that are on site at the recovery site. And we need to consider how we’re going to transport them, especially if there’s been a natural disaster. And of course, eventually, all of that needs to come back to the main site if this is something that will only be used temporarily.

Another type of resiliency is platform diversity. We know that operating systems have inherent vulnerabilities that are located in the OS. Some of those vulnerabilities are known, and we can patch them. But other vulnerabilities simply haven’t been discovered yet.

However, it’s very common that these vulnerabilities are specific to a type of operating system. For example, if you have vulnerabilities that have been identified in the Windows operating system, it’s unlikely that those same vulnerabilities exist for Linux or macOS. So to minimize the potential for one single vulnerability causing a problem for all of your systems, it might make sense to have different operating systems used for different purposes.

We might use Linux and Windows in our data center and then have macOS and Windows devices on our clients. This allows us to spread that risk around and perhaps could limit any exposure to one single vulnerability.

We can also provide resilience in the cloud itself. Of course, there’s not just one single cloud provider. We might be using a cloud that’s from Amazon Web Services, Microsoft Azure, or Google Cloud. All of these cloud providers have their own processes, their own procedures. And of course, if there’s an outage with one cloud provider, that generally does not affect other providers.

Ideally, if there is an outage with one provider, we might have similar services available on a separate provider. We could provide application services from multiple cloud providers and have data in different locations as well. This also is important from a security perspective. If there is a security concern with one provider, having our information with different cloud providers may provide us with uptime and availability that wouldn’t be there if we were only using a single provider.

We’ve talked about disasters and outages with the idea that we’d have those services available elsewhere. But what if there are no available technologies elsewhere? We would have to then use some type of failback method. We refer to this as a Continuity Of Operations Planning, or COOP.

We’ve almost become too accustomed to our technology. And when technology is not available, we need to find some nontechnical way to still provide the same services. For example, you could have a manual process for completing transactions and having people sign off on them physically. Or you may be providing paper receipts instead of the automated receipts that might come from your point-of-sale system.

And if your automated credit card transaction approvals aren’t working, there may be an option to call that particular card in to see if you can get approval over the phone. These are obviously processes and procedures that would need to be thought of and designed well before you would need them so when that issue or disaster suddenly arises, we know exactly what to do to keep all of our services up and running.