Network professionals solve problems every day. In this video, you’ll learn how a troubleshooting methodology can simplify complex issues and make it easier to identify and solve big problems.
When you work with networks, there will be problems that need to be resolved and issues that will need troubleshooting. And in this video, we’ll look at a flow chart that takes you through the process of how you can troubleshoot different aspects of your network. This is a high-level overview. And the idea is to apply these concepts to any problem that you might run into in the field.
We start with identifying the problem, understanding where the issue really is associated with this particular problem. And then we need to establish theories, test those theories, and evaluate the results of those theories. We can then establish a plan of action, implement the plan, verify full system functionality, and document our findings.
For the first phase where we are identifying what the problem really is, it might be very obvious. You might walk up and there is a cable along the floor that is in two pieces. That’s an easy one to determine where the problem happens to be. But often, the issue is not quite so cut and dry. And you might want to see if you’ve really found the problem by attempting to duplicate the issue.
One of the best places to go for more information is the users themselves. What are they experiencing? What have they seen associated with this problem? Those experiences can be combined with the statistics and metrics that you’re gathering from your routers and your switches to try to build a better view of what the problem actually is.
We often think of these problems as having a single symptom that we need to solve, but in reality, there might be multiple symptoms working together that are causing this particular issue. One of the questions we should often ask ourselves is, what has changed from this configuration today to the one we were using yesterday? Is someone in the wiring closet and is moving cables around? Is there a particular system that has been powered off? Did we make changes to a configuration in the meantime?
All of these questions need to be considered to help understand what the problem might actually be. Many network professionals will build a lab where they can attempt to duplicate the problem. If they can duplicate the problem, then it’s easier to find the root cause and ultimately resolve the issue. And if the scope of a problem is too large, you may want to break it into smaller pieces and take one unit at a time or one component at a time to try to determine where the problem is really happening.
Now that we’ve gathered all of this information about the problem itself, it’s time to start creating theories of what could possibly be causing this issue. We usually want to start with obvious problems and problems that can be solved very quickly. Is the issue related to a cable? We can swap a cable and see if that changes the issue. Often, though, the problem is more complex, and we need to dive into the specifics to really understand where the probable cause might be.
We might want to start with a top-down approach from the perspective of the OSI model. So we would start with an application and work our way down into the network. Or maybe this is a new network implementation and we would prefer working from the bottom-up. You might want to verify the network is working properly, and then move towards the application.
And if you can eliminate different aspects of the environment, you might simplify the troubleshooting process. If the issue appears in one operating system and it also appears in a different operating system, then the problem obviously is not directly associated with the operating system.
Now that we have a theory of what we think the probable cause might be, let’s put together a series of steps where we can test this theory to see if it really might be the cause of the issue.
For example, we may have a theory that the problem is associated with a configuration. So in our lab, we might change that configuration and evaluate the change. Did this change have a positive effect and effectively solve the problem, or does the problem still exist? If making that configuration change did not solve the problem, then it’s not fixed yet. We can go back and establish a new theory of what we think the issue might be.
Let’s say that we have finally identified the configuration change that is going to solve this issue for everyone in our production environment. Now we have to come up with a plan of how we’re going to implement that change into the existing production systems. With some organizations and some types of changes, you can make some updates during the day to help resolve that issue. But many times, the production network cannot be touched during the day and you have to schedule change control to make any significant changes.
And, of course, if we’re going to be making any changes to our production environment, we need to be prepared for anything. So we want to have a plan that we’re going to use to put this change into effect. And if we run into problems, we’ll need a plan B. And if that runs into problems, we might even need a plan C. And, of course, if you’re following the normal change control process, there’s probably also a rollback process just in case you need to go back to the way things were before you even started.
So now we’ve identified what we believe will fix the issue, we’ve gone through the change control process, and we’ve gotten some time during the day that we can make this particular change. Now it’s time to implement the fix into our production environment, usually during the change control window that was assigned to us. And in some companies, the folks that implement the change are different than the folks that determine what change needs to be made.
So you might have a troubleshooting team that determines what needs to be changed, and that change is then handed off to the operations team to make the actual change. Although we’ve had this in our lab and we made the appropriate changes, and we believe those changes will resolve this problem, it’s never confirmed until we can get users to make sure that the problem they saw before is really now resolved.
We need to verify full system functionality. And often, this involves the end user or the folks that identified the problem to begin with. This might also be a good time to sit down with those users and talk about ways that we could prevent this problem from occurring in the future. They might have ideas of things that might help them work better, and you might have ideas of how you could prevent this from a technology perspective.
And, of course, nothing is finished until it’s documented. We need to be sure that we are not only documenting the process we followed, but we need to identify what change we made that resolved this issue. This way, if we happen to run into this issue a year from now, we can refer back to our documentation to see exactly what we did to resolve this problem.
In many cases, there’s a help desk database or a knowledge base where we can store this information, search through those details. And hopefully a year from now, we’ll be able to easily find and reference this information that we worked so hard on today.
So that’s the troubleshooting process. We need to identify the problem, which involves gathering information, symptoms, talking to users, and determining if anything in our network may have changed. We then need to establish a theory on what we think is really causing this problem. Of course, the only way to tell if our theory is correct is to put it to the test. So we might go to the lab, make those configuration changes, and see if they had any effect on the problem that we’re seeing.
If that theory does not fix the problem, we can go back to establishing a theory and go through the process again. Once we’ve identified the fix in the lab, we now need to put together a plan of how we’re going to implement this in our production network. We would then get a change control window, implement this into our production network, and then have our users test the fix to see if it really resolved the problem.
And, of course, we need to document this entire process so that ourselves or someone else can use what we did today if this problem happens to occur tomorrow. That’s our troubleshooting process. Hopefully, that’s given you some ideas of how you can make your troubleshooting process go as smoothly as possible.