Keeping the network running is a primary objective of any network administrator. In this video, you’ll learn about high availability, redundancy options, and NIC teaming, and other methods to maintain uptime and availability.
One of the primary goals on our enterprise networks is to maintain the uptime and availability of the network and the applications. To be able to do that, we often need to implement some type of fault tolerance. Fault tolerance allows us to keep everything running in the case of a failure. Sometimes this fault tolerance happens behind the scenes and no one knows it’s happening. But sometimes, the systems that we’re using for fault tolerance may cause outage or degrading a performance.
In almost all cases, however, fault tolerance adds additional complexity, both to the network design and to the processes and procedures. This also, of course, increases cost, since you’ll need some type of alternative should your primary systems fail. Sometimes this fault tolerance can exist in a single system. For example, you might have a RAID array, so that if you lose one drive, an additional drive is there to maintain the uptime and availability of the data.
Maybe in a single server, you have redundant power supplies. So if you lose one power supply, you have a fault tolerance supply that you could use instead. Or perhaps, you have redundant network interface cards, so that if you lose one network connection, you always have another network to rely on.
You can even combine this with multiple systems all working together. For example, you might have a server farm that has load balancing with multiple servers behind the load balancer. If any one of those servers fails, the system continues to run because the load balancer will redirect that traffic to a server that’s still operating. Or you may implement multiple links to the internet, and if one particular internet provider fails, you still have a backup connection to provide that connectivity.
When designing these fault tolerant systems, we often consider redundancy as an easy way to provide that fault tolerance. This means that you would have multiple components, and if one component was to fail, you could use the other component instead. For example, we mentioned earlier, in a server you may have multiple power supplies. And if one of those power supplies was to fail, the other one continues to operate and the server continues to be available.
We can have the same redundancy through the Redundant Array of Independent Disks, or RAID. By using certain RAID configurations, we can lose a drive, but still maintain the uptime and availability because we have other drives and other redundancy methods built into the RAID array. If you live in an area where there are power outages, then an Uninterruptible Power Supply, or UPS, is a great way to provide fault tolerance.
You might also want to cluster together a number of different servers, so that you have a way to maintain uptime if any one of those servers was to fail. And as we mentioned earlier, having a load balancer not only allows us to share the load across multiple servers, but if one server was to fail, we could take that server out of the list and have all of the other servers take up the extra load
Let’s build a network configuration that includes some fault tolerant connections. In this diagram, we’ve got a single internet connection connecting to a single firewall. That firewall then connects to a router, which then connects to a switch, and then, finally, to a web server. If we happen to lose any of these components, the entire connection is now unavailable.
For example, if we lose this firewall, we’re no longer able to communicate between the internet provider and the router. With a fault tolerant configuration, we might have a spare firewall that we can, then, power on. We can replace the existing firewall. And now, everything is back up and running because we had a fault tolerance system that we could use to replace the faulty unit.
In our previous example, we had a fault tolerant firewall, but we had to power that firewall and connect it to the network to provide the redundancy. It would be much more advantageous if your systems could automatically recognize the firewall had failed, our secondary firewall was already in place, and we could immediately begin using that redundant system. We refer to this nearly instantaneous switch between a failed system and one that’s working as high availability.
High availability means that the systems are, generally, always on and always available. And if we have an outage with one system, we can very quickly switch over to another system, usually, so quickly that the users have no idea that a failure has occurred. In some cases, this high availability involves a number of different systems working together.
In our previous example, we could have had that secondary firewall already in the rack, already running, and already monitoring the system. And if that second firewall notices that the first firewall has failed, it can automatically take over to provide connectivity.
And again, we have to consider that having this highly available system also means that we’re going to be purchasing multiple systems. And there will be a higher cost associated with that. So if we’re adding additional power supplies to a server, we’re increasing the number of components on a server farm or we’re adding an additional firewall. And there will be a cost for implementing that into the network.
Let’s see how high availability would work on a system with a load balancer. This is very common configuration, where you might have an internet connection with different users on the internet that need to access web servers that are on the other side of a load balancer. The load balancer has a Virtual IP, or VIP, on the outside, so all of the users are connecting to this single, individual, virtual IP.
Behind the scenes, of course, each one of these servers has its own IP address that’s only known to the load balancer. And in this example, we have server A and server B are currently active because those have the green lights. But we also have a server C and server D that are on the network and powered up, but not currently being used by the load balancer.
If we happen to have a failure of one of these initial servers, users that would normally connect to a particular server would find that server is no longer available. If server A fails, the load balancer will recognize that server is no longer responding to any type of queries, and it will begin using an additional server in its place. All users that were normally connecting to server A would now connect to server C.
Let’s take that original network design and build it out to be highly available. We may want to add an additional firewall and have both of those firewalls on the network simultaneously, so that if you lose one firewall, the other one can take up the slack. You might also want to have multiple routers for exactly the same reason.
If a router fails, we need a secondary router to be able to provide connectivity. The same thing would apply to our switches. And, indeed, on the back end, we may even want to consider using multiple web servers.
We may want to include a load balancer in the middle, so that we can have multiple web servers and have high availability that’s managed by the load balancer itself. We might, also, even want to consider adding an additional internet provider, so that if one internet provider is not available, we have a secondary system to rely on. As you can see, this can very quickly increase the complexity of the network design and the costs associated with these implementations.
Another method of providing fault tolerance and high availability in the data center is to use LBFO, or Load Balancing Fail Over. This allows you to have multiple links between devices. These multiple links will not only aggregate the bandwidth between those devices, but if you lose one of those connections, the other connection is there for high availability. This can be very useful if you’re building redundancy between different cloud services or different physical virtual systems.
This is usually implemented by having multiple physical network interface cards in a device, but all of those cards are bound together to look like one single very large interface. This is usually integrated with the switches that you’re using, so that they can properly forward the traffic to the correct network interface card.
These network interface cards are constantly sending hello messages to each other and responding to those hello messages. If an interface card stops responding, then we know that interface card may not be available, and we can take that card offline and use the remaining network interface cards instead.
With multiple network interface cards in a server, you have a number of different implementations for high availability. You could use port aggregation, where you have multiple links to a single switch. This means that instead of having a single 1 gig connection to a switch, you could have two separate links and use all 2 gig of that connection, effectively, doubling the bandwidth to and from the server.
This still provides high availability because if you lose one of these connections, or someone pulls the wire out of one of these network interface cards, the other interface is available to provide connectivity. You could take this a step further by providing multipathing, where you have multiple network interface cards in the server, but instead of connecting to a single switch, we connect each NIC to its own switch. That way if the entire switch was to fail, you would still have connectivity to the rest of the network.