A significant part of a technician’s life is spent troubleshooting problems. In this video, you’ll learn about the process of analyzing, planning, and implementing solutions to technical issues.
<< Previous Video: Mobile Device SynchronizationNext: Troubleshooting Common Hardware Problems >>
If it’s your job to solve a problem on your network with your computer or in any other part of your IT organization, there are a series of steps you can go through to help guide you through the resolution of that process. And in this video, we’re going to step through this entire flow chart and give you an idea of some of the best practices that you can put in place to solve these problems.
You can’t solve a problem unless you know what’s going on. So the first step is going to be to identify the problem. This is where you’re going to be doing a lot of information gathering– really documenting and determining exactly what the problem happens to be. After you’ve gathered this intel, you can start thinking about what could possibly have caused this particular problem. And you would come up with a few ideas of things you can do to solve the issue.
You won’t know if your theories are correct unless you apply them. You need to test those theories to see if they’re really the cause of the issue. And if your theory doesn’t work, you go back to the drawing board, come up with another theory, and test that one.
Now that you’ve tested the theory and you’ve come up with what you believe the fix might be, it’s time to apply that fix. So we need to come up with a plan that would allow us to implement that fix and identify any potential effects of that fix.
Now that we’ve created a plan, we can implement that change and see if it resolves the actual problem. And, of course, we need to put together any plans so that this problem doesn’t occur again. Now that we’ve gone through this entire process, it’s useful after the fact to document this. That way, if it ever occurs again, we can go back to our notes, determine what resolved it last time, and apply that fix without going through this entire process again.
The first step of identifying the problem is often the most critical. This is where you are gathering as much information as possible about what this problem really is. And this is where we’re going to be asking a lot of questions. If somebody says, the network isn’t working, you need to find out why they believe that’s the case. So you’ll be asking a lot of questions. Why do you think the network isn’t working? Are you seeing an error message? What does that error message say? And that way, you can gather as much details that you can then use later on.
There also may be multiple symptoms involved– a printer isn’t working, someone can’t log in, and the network communication isn’t working to a particular website. All three of those things may be the exact same problem. So it’s useful when you’re getting a lot of calls and a lot of tickets that you’re piecing together what these multiple symptoms may all have related to each other. And ultimately, you may want to ask your users. They’re the ones who are experiencing the problem, so you want to get their perspective of exactly what they’re seeing on their desktop.
This is also a good opportunity to find out if someone may have changed something. We used to ask if someone was in the wiring closet at that particular time when we’re trying to troubleshoot network problems. Did somebody modify or patch an operating system on a server? Is somebody doing maintenance in the data center? Sometimes people can make a change not realizing that it might affect some other part of the network or other applications. So you want to be sure that nobody is making any changes or, if they have made changes, what those last changes might have been.
Now that we have all of these details about where the problem might be, it’s time to start piecing together some ideas of what’s causing the problem. We need to come up with a few competing theories and determine which one of these is the one that we would like to choose. Generally, we start with the easiest ones first. Occam’s Razor says that we can find the simplest set of hypotheses and follow the simplest ones from the very beginning. That’s why we ask questions like is it powered on? That’s a pretty simple problem, and it’s one that you can check and hopefully resolve very quickly.
You also need to consider every reasonable possibility. Certainly we don’t worry about aliens coming down and destroying our network, but there may be changes to a particular switch, a particular router that now cause you to question how the operating system might be performing in those devices. Maybe it’s the switch itself causing our problem, or it may not be. We need to come up with a number of different criteria to choose from. And so we need to put together every possible scenario. And then we can start determining which ones of these possibilities might really apply here. We can throw out the ones that may be completely off the beaten path, but we may also want to start with some easy theories and work our way until we get into some of those much more complex theories.
If you’re in a very large organization, you may not have the luxury of changing things in a production environment just to test some of your theories. So you may take this into a lab and go through the process of testing it. Apply a different patch. Remove a patch that was applied last night. Does the problem still exist or did the problem go away? We can go through each one of our theories to determine where this might be occurring. And if we aren’t quite certain that we’re actually finding what we need, we might want to go back and try a different theory, or perhaps bring in an expert who knows a lot more about the subject matter than we do. This is one of the advantages of information technology. It’s so broad that you can find experts that are very specifically trained in some of the things that we’re trying to solve.
In some environments, the troubleshooting team doesn’t have access to the production servers, so you have to write up some documentation and hand that off to your operations team because they’re the ones that are actually going to be resolving the issue. And most of the time, this is done during a non-production hour. You have to schedule maintenance and schedule some change control so that everyone knows exactly what’s going to be done, and it’s going to have the smallest impact on your user community.
Even when you’re going through the process of resolving an issue, you still need to have some alternatives. If you apply a patch and something bad happens in production that you weren’t expecting, you need a plan B. Do we roll back to a different version? Have we already performed a backup prior to making this change and we can simply revert to the backup configuration? Regardless of what that alternative might be, we need to have that plan in place so that when we make the change and something goes wrong, we can easily go back to the original configuration.
Now what we’ve created that plan, we can actually implement it. We can give it to our operations team. They can go in after hours, they can apply the correction to the server or to the network and try to resolve that issue. If this is something that might be over a large scale, it might be with a very specialized piece of equipment, it may be that you need a third party. You may be calling experts to come in and assist with the actual fix of the problem.
Just because you implemented a fix doesn’t mean you actually resolved the problem. So you want to be sure that you have some tests in place– some things that you can go through after implementing the test that will show whether the problem was resolved or not. Sometimes these are relatively simple, but sometimes you may need help from your user community, perhaps many people at one time hitting the same application and seeing if those problems might occur.
And there may be a way that, during the implementation of the fix, that we create some preventive maintenance. Maybe now we create a script that will check for that particular problem in the future. Or, if it’s hardware related, maybe we have additional hardware on standby just in case this problem ever occurs again.
You’ve now done a lot of work from the very time that you’ve identify the problem all the way through the verification of that fix, and it would be a shame to lose that. So it’s important to document exactly what occurred during that entire process. That way, if you run into a situation where this problem occurs again and you don’t happen to be in the organization, somebody else can look up your notes. That’s why we often will automate this, put it into a large database, have something that we can search through to try to find that information later on.
If you want to see what very long-term documentation can provide, a good example is in this document from Google research, that is “Failure Trends in a Large Disk Drive Population.” A number of years ago, Google looked at all of their hard drive usage, determined where the failures happened to be, and tried to make some determinations of what they saw occurring with those systems prior to the hard drives failing, and they were able to document this because they were looking at these problems over a very long period of time. You can read more about it with this URL right here at the bottom, and you can see how Google was able to take their documentation and use that information to make their network more available.
And there’s our summary from the very beginning to the very end. Once we see a problem, we’ve identified and asked questions about the problem. We’ve come up with some ideas of what the problem might be, and we’ve tested those ideas. When the test is actually working, we can come up with a plan to change, and then finally implement the plan. After that plan is in place, we need to test it and make sure we really fix the problem. And finally, we can document everything that we happen to find. By following this simple troubleshooting process, we can solve our problems and get our systems running as quickly as possible.