The fundamentals of troubleshooting remain consistent across the network. In this video, you’ll learn how to break problems into smaller pieces and how to methodically solve complex issues.
As a network administrator, you’ll be doing a lot of troubleshooting. In this video, we’ll step through this network troubleshooting flow chart and see some of the best practices used in the industry to find and fix the problems on your network. Before taking any action or changing anything on your network, the first thing you should do is gather as much information as possible. This is the first phase of the problem, where we are identifying really what the problem happens to be.
It will be important to gather as much information as possible. And it would be very useful if we had a way to duplicate the problem. Sometimes, we’re trying to decipher what the problem might be from information that was left inside of a help desk ticket. So it’s very useful to reach out to the user and talk to them about exactly what they might be seeing. There may be a number of symptoms that they might be experiencing.
There might be error messages on the screen. There might be slowdowns to the network. And all of these details can help you narrow down where this problem might be occurring. Sometimes, a problem begins because of something that we did. So it’s useful to know if anybody would be working inside of a wiring closet or if any changes were made overnight.
Although some of these problems may have a very wide scope and seem almost insurmountable, every issue can be broken down into smaller pieces. And then you can start troubleshooting each piece at a time. Given all of the alternatives, the simplest explanation is usually the right one. So as we start looking over the notes that we made about this particular problem, we can start arranging them in a way that the most obvious or most common things would be at the top of our list.
But of course, this problem might be caused by something that isn’t obvious. And we still need to make a note of what those might be. This would allow us to get a full perspective of where this problem might be coming from, even if it turns out to be something that isn’t very common. Some people like to start at the bottom of the OSI model and make sure that their cable is working properly, that signal is getting through the switch, and then they can work their way up the OSI stack.
Other folks prefer to start with the application and work their way down. Both of these are valid, as long as we’re starting at one end, breaking the problem into smaller pieces, and discarding anything that may not be associated with our particular issue. Now that we have a list of things that could be causing this problem, it’s useful to test our theory and see if it really is causing the problem.
If we think the issue is related to a bad cable, we can swap out the cable and see if the problem is still there. So once we’ve replaced the cable and realized this did not resolve our problem, we need to go back to our list and see what the next step might be. You might go through all of your ideas about what this problem might be and still not be able to resolve it.
In those cases, you may want to call an expert who can bring in some additional knowledge to help you troubleshoot. Sometimes, resolving the issue may take a number of steps. You may have to install additional software or it may require a very extensive infrastructure change. In those cases, we may need to build a plan that shows how we would resolve this problem with the minimum amount of impact to our uptime and availability.
If the system is already down, you may be able to replace the cable or perform this troubleshooting task without causing much of an outage. But most of these types of problems have to be planned and resolved during non-production hours. As with most change control, we need to not only have the primary plan for resolving this issue. But there needs to be backup plans as well. If we can’t resolve it with our initial plan, then we need to have some other tasks that we can perform while we have this downtime.
We’ve now got a plan A and perhaps a plan B and a plan C. We’ve got a change control window. And now it’s time to try to resolve this problem. We can replace the cable, upgrade software, or do anything else required during our change control window. If we don’t have the resources we need to be able to resolve this in a timely manner, we may want to contact a third party and see if we can use some additional help in resolving this issue.
Of course, replacing the cable or upgrading the software doesn’t guarantee that you fixed the problem. You need to circle back to the customer. Have them try the fix on their system. And have them determine if you’ve really resolved the problem. It may be that this problem could have been something that may have been avoided to begin with. So we might want to update our policies and procedures to implement some type of preventative measure.
Once everything is back up and running, the customer has confirmed that everything is working as expected, we need to document exactly what we did, so that the next time this occurs, we have a knowledge base we can go to help us understand how to resolve it faster. Most help desk software includes a knowledge base or some type of searchable database.
And we can put everything we know into that database for next time. These are the steps that you’ll follow every time there’s a problem. Some of the steps you’ll spend more time on than others. But the entire process is one that’s very common to troubleshooting networks, applications, or anything in your environment. We’ll start with identifying the problem and gathering information that can help us understand more about what’s happening with this issue.
Then we can come up with an idea of what possible causes might be that would create this particular issue. Then we can test that theory to see if we can confirm that what we’re seeing is being caused by what’s in our list of theories. We can try those theories one at a time until we find the one that resolves this particular problem. Then, we can create a plan that would resolve this issue and identify any problems we might create during the resolution process once we get a change control window, we can use the plan that we’ve created to implement the change on our network.
And then we can confirm that this change has really solved the problem. This often involves reaching out to the customer and having them test the issue to see if we really were able to fix the problem. And lastly, we want to be sure we document everything, so that we have a knowledge base the next time this occurs. You’ll find that following this flowchart will allow you to troubleshoot issues with your network, your applications, or anything in your data center.