Is the network slow? In this video, you’ll learn about some of the most important metrics to monitor to ensure the uptime and availability of the network and applications.
On your network, you’ve probably installed routers and switches and firewalls and other infrastructure devices. But how do you know if those devices are performing well? There are a number of performance metrics that can give you information about the health and availability of these devices. For example, we can start by looking at the temperature inside of that device. Many of these infrastructure components have temperature sensors inside of them.
And some of them have multiple temperature sensors, so that you can see the difference between the CPU temperature as compared to the temperature of the air that’s passing through the system. If you start to notice a trend, where the temperature seems to be rising or it seems to be going higher than it normally would be, this might indicate some type of hardware problem or issue with the software on the system. This is something that is important to monitor over time so that you can see these differences in the trend.
It’s nice to be able to compare what’s going on today with what the temperature might have been a week or a month ago. Another important metric is the utilization of the central processing unit, or CPU. This determines how much work this device is doing. And if you notice that the CPU is increasing, it could be that this device is under a heavier load. This is often the first metric we look at because it gives us a broad view of how this particular device might be performing.
The memory inside of these devices is used by the system to be able to operate. So we want to be sure that we have plenty of memory available for all of the processes. If you see that the memory is increasing higher and higher, and there’s less available memory for the rest of the system, you may be getting into a situation where that device is no longer able to operate. Most of these infrastructure components are connected to the network in some way.
And usually these devices are passing or forwarding traffic from one network to another. It would be useful to know what the utilization is of the traffic going between those interfaces or onto those networks. So having some way to monitor the bandwidth can provide you with some important metrics. There are a number of different ways to gather this information. A very simple way might be with Simple Network Management Protocol, or SNMP.
But there are other more advanced methods such as sFlow, NetFlow, and IPFIX that can tell you more about the utilization of a network. You could even have software agents or capture packets yourself to be able to see detailed analysis of what’s happening on the network. If you see that the utilization is getting above 85% or 90%, then this might be an early warning to let you know that there’s no more available space on this network and that a large amount of traffic must be traversing this particular link.
It’s also useful to measure the latency on a network. The latency is the delay between a request and a response. Being able to measure that gap on the network allows you to understand what overhead may be in place when you’re performing these network requests. There will always be some type of latency. If you’re on a local network, this latency is very small. But if you’re communicating over a non-terrestrial or satellite link, there could be very long latency between each request and each response.
If you feel that your applications are performing well, you could take a packet capture and measure the response times for every request and every response. This will help you understand more about what the network response time is versus the overall application response time. Getting these applications captured with a packet analyzer can provide you with detailed analysis of how the application is performing down to the microsecond level.
This can be the difference between understanding if the problem may be associated with the internals of a database server or whether the issue may be associated with a wide area network. We deal with a lot of real-time media on our networks. Much of this could be Voice over IP conversations, where we are communicating with someone else in real time. And we want to be sure that this traffic is able to go back and forth between those phones without any type of delay.
There might also be video communication that needs to occur, and any delays with that video will create problems with the overall communication. One of the issues with this live media is that, if you lose a packet somewhere along the way, there’s no way to recover that information. You can’t request a retransmission of that information because that moment in time is already gone. You can only hope that you’re missing only a small amount of information that doesn’t disrupt the overall conversation.
From a network management perspective, we can measure the delay between each of these frames and maintain an ongoing calculation of what this jitter value might happen to be. If you have an excessive amount of delay between frames, or excessive jitter, then you may find that it’s difficult to keep track or understand what’s happening with your real time telephone calls. The way this should work is that we have traffic going across the network over time, and hopefully those frames are coming in in very regular intervals.
You can see that these frames are not exactly the same between each other, but at least there’s no excessive jitter between any one of these frames. If you do have delays on the network, you may find that the frames do have sections of excessive jitter. And this could cause these real-time applications to be very difficult to use. It’s very important that you’re able to monitor these individual network interfaces on these devices often because it’s your first line of attack and can give you more insight into potential problems that may be occurring later.
For example, a small number of errors occurring on an interface could be a precursor to a much larger number of errors that could happen over time. Being able to monitor these statistics and react to these problems could potentially resolve problems before they take part of the network down. Much of this information can be viewed directly in your operating system. You can see the status of the interface that you’re using.
But often these devices are located on a different part of the network or even in a different location. So you may need some automated way to be able to monitor those interfaces. And we can do that by using the Simple Network Management Protocol, or SNMP. SNMP allows you to query the statistics on that device. The device will respond back with an answer of what those statistics are. And you can continue to do that over time to be able to build a trend of what’s happening on that system.
Many of the devices you’re using will have a very standard database they use for these statistics. In SMNP, we call this a Management Information Base, or a MIB. And one type of standard MIB is MIB-II. If you’re connecting to a device, you should be able to query it with the standard MIB-II queries and get responses back in that MIB-II format. Many devices will also include their own proprietary MIB as well, so that if it is a printer or a firewall or a switch, you will have specific statistics that are for printers, for firewalls, or for switches.
One basic metric that’s important to monitor is whether an interface is active or whether it’s not active. Being able to monitor this on a switch can give you insight into whether a link on the other side of the switch may have failed and therefore created an outage on the interface on your local switch. We very commonly would alert or alarm on that link status so that we can be informed immediately that that link has failed. We also want to monitor these interfaces over time to understand if any errors may be occurring.
We can look at Cyclic Redundancy Check errors, runts, giants, and other Ethernet errors as well. And of course, we want to check every interface to understand what the utilization might be on that connection and be able to understand if we happen to have any spikes or any drops in the normal utilization percentages. If we’re concerned that an interface is not able to send or receive the proper amount of traffic, we could run some bandwidth tests to be sure that that particular interface is working as expected.
Normally, these infrastructure devices are transferring data from one interface to another or forwarding traffic through a switch. And we want to be sure, if an application is being used on the network, that these devices are able to properly forward that traffic. If there is a problem with the data or the device itself becomes over utilized, we may find that we have discards or packet drops. So monitoring those metrics can tell you if a particular interface or a particular component within that device is having problems.
We might also run into situations where an interface is not able to communicate properly on the network or you find that an interface is constantly resetting itself. This would be a situation where packets may be queued inside of that particular interface, but none of those packets are being properly forwarded. You can think of this as the system only working half way. It’s connected to the network. But for some reason, it’s not able to properly forward traffic.
In those situations, we would do the typical IT response, which is power off that interface. Turn it back on. And hope that it’s cleared itself out and now is able to work properly. And of course, we want to be sure that we’ve configured these interfaces properly for the proper speed and the proper duplex. We can look at the statistics of both speed and duplex on one end of the link.
Compare that to the settings on the other end of the link. And make sure that both of those are matched and are working as expected. What you may find with certain switches in certain situations that one end of the cable may be configured for full duplex and the other end of the cable auto negotiates itself to be half duplex. When that occurs, you’ll see a number of different errors, such as a lower throughput and late collisions, which may indicate to you that the configuration of speed and duplex on that link may not be matching on both sides.