As a technology professional, you’ll be expected to find a solution to every problem. In this video, you’ll learn about the troubleshooting process that you’ll follow when resolving these issues.
<< Previous Video: Client-side Virtualization Next: Troubleshooting Common Hardware Problems >>
A big part of information technology is solving problems. And in this video, I’ll take you through a step by step process for troubleshooting. In most organizations, troubleshooting a problem involves a series of well-documented steps. These steps are designed to give you a logical path to follow that will give you a way to troubleshoot the problem quickly and easily.
One of the reasons we have this formal troubleshooting flowchart is because of change control. Change control is a way that we could manage any changes that might occur in our environment. This is very commonly seen in organizations that would like to minimize the amount of downtime ans mistakes that might occur when a change takes place. When you’re at home, you can make changes to your operating system or changes to your local network without informing anyone else the change is taking place. But outside of your home networks in your organization, there’s a formal process for making these changes.
This process starts with the planning process. You need to decide what change might occur. For example, you might need to plan to perform a software update or an operating system patch on a server. Before making that change, you need to determine what the risk is for that particular change. If you make a change to a server, and there is some type of problem, will that server become unavailable? And if that server is unavailable, how does that affect the overall business?
Change management also means that there is a recovery plan, so you can implement a change, and if that change does not work, you have a way to revert back to their original configuration. It’s common to have a duplicate server in a lab that you can then perform the update and then perform tests to make sure that the server is working normally.
Now that you understand the impact of this particular change, you perform testing to make sure that it works, you can document this information and present your results the change control committee. At that point, they’ll approve or disapprove the change and then decide where on the schedule that particular change will occur. And then on that date, you’re able to make those changes. As you can see, this change control process is much more involved than making changes by yourself on your home network, but it’s designed to make sure that all of the systems and all of the applications are always available and that everybody knows what changes might be occurring at what time.
When you’re working on solving some type of technical issue with your network or your operating system, you’ll want to follow a troubleshooting process to make sure you’re able to solve this problem as quickly and easily as possible. We’ll step through each section of this troubleshooting process so that you can understand a best practice for being able to solve these issues.
The first step in solving a problem is understanding the problem from the beginning. This is perhaps one of the most critical phases of the troubleshooting process because if you aren’t able to identify the issue, you won’t be able to solve the problem. To be able to identify this problem, then we need to gather information. We want to gather as many details as possible about this particular issue. You want to be able to duplicate this issue. And one of the ways that we’re going to be able to duplicate it is to know exactly what the problem is to begin with. You want to be able to identify all of the symptoms that are occurring when this problem is happening. You may find that multiple symptoms are occurring, and that might be related to a single problem, or there might be multiple problems that you’re troubleshooting simultaneously.
If you’re working with a user that’s having this problem, make sure you ask them as many questions as you can about the issue. What type of problem is occurring? Do you see the error messages on the screen? What happens after the error message is displayed? Trying to gather as much information as possible so that you can understand exactly what the user is seeing from their side.
This is where change control might be able to help you because there’s some problem that’s occurring today that wasn’t occurring yesterday, so it might be useful to know if any changes occurred during that time frame. If you’re identifying multiple problems during this phase, it might be useful to separate them into separate pieces that way you’re able to evaluate each one individually. The problem might be interrelated between all three, or you may find there is a different root cause for each individual problem.
It’s during this problem identification that you may want to perform some backups. You will eventually be making changes to this environment. So it may be a good idea to have a backup that you can restore to if you run into problems. You may also want to check other help desk tickets or other change records in your organization to see if things may have changed that the user may not know about. There may be changes to the infrastructure or the underlying network that no one in that department may know, but may be causing a significant problem with this application. You’ll also want to make sure that you’re gathering as many log files as possible. An operating system has extensive log files available, and some applications will have their own log files that you can use during the troubleshooting process.
Now that we’ve gathered information about the problem, we need to establish a theory about why the problem is occurring. And with most things, the simplest explanation is often the most likely. So use Occam’s razor to be able to make a list of possible reasons that this problem is occurring. Of course, sometimes the explanation for the problem may be relatively complex. So you need to think about all possible reasons that might be causing this issue, even reasons that may not be completely obvious at first glance. Make a list of all of the possible causes for this problem. Start with the most easiest issues to resolve at the top, and then the more complex ones near the bottom.
This means during the testing process, you can start with the least difficult issues to test. You may be able to resolve the problem very early on, or you may find that the issue is more difficult to resolve. So you may end up going further down the list to issues that are more complex to be able to troubleshoot and test. And of course, you should use external sources to be able to gather more information. You can often find details in a third party knowledge base, or use your Google skills to see if someone else may have run across one of these more esoteric issues.
Now that we’ve made this list of theories, we can perform the testing to see if these theories are actually resolving the problem. If your first theory is that it’s a bad cable, you can replace the cable, run your test and see if the problem was resolved. If that didn’t work you can move to the next theory on your list and then the next theory down. At some point you may find that calling an expert in this particular area might help you out, whether it’s an expert that’s internal in your organization or it’s an expert that you can call in from a third party to help resolve this particular issue.
You’ll go through this process of testing a theory, evaluating it, and then going back to test the theory over and over until you find a resolution. And if you do find a resolution, then you’re now able to begin the process of making a plan to resolve the issue in production. The goal of this plan should be to resolve the issue with the minimum amount of impact. We don’t want to bring the system down for any longer than possible, and we want to make sure that the user has access to all of their data.
This might mean that we have to resolve the problem when the users are not in the building. If that’s the case, we may want to set our hours during non-production times to be able to implement this change. As we’re writing down this plan of action, we may want to consider creating a plan B or even a plan C, that way if we run into problems with plan A, we can still resolve this issue by going to the next plan on our list.
We’ve now taken our plan to the change control committee. They’ve given us a frame that we can use to implement the plan. And then we show up in the data center and begin the implementation process. If this plan is relatively complex, we may need to call in additional resources. So don’t be afraid to call a third party, either internal or external to your organization, to come in to be able to help resolve this issue. This might be very important, especially if you have a very small time frame in order to make this change. You want to have as many resources as available and as many people that can help if you run into problems.
After performing the implementation, you now need to perform testing to make sure that the changes you put in are the ones that actually resolve the problem. This might be a test that you’re able to do yourself, or it may require bringing your users in to perform the test that they can duplicate on their workstations. Now that the problem’s been resolved, it would be nice if the problem didn’t occur again.
So you might want to evaluate the issue and see if there are any preventative measures that you can implement, so that this issue doesn’t occur again. If other people run into the same problem, it would be nice if they had some documentation they could reference so that they could follow the same path you use to be able to resolve the issue. That’s why it’s so important to document these issues each time a problem is resolved.
You may have a knowledge base or a database of information in your environment that you can use to make sure that others have access to this valuable data. For example, what was the error message that people were receiving? What action did you take? And what was the outcome of the changes that you made? You might be able to put this into a centralized knowledge base. There might be a wiki or some other type of database that you can use to document all of this information.
Let’s step through the troubleshooting process one last time we’ve run into a problem. So the first thing we’re going to do is gather as much information to identify what the problem actually is. This might be error messages, information from our users, or log files. Then we can establish a theory of what we think the problem might be, and then we can perform some tests to see if our theory is going to resolve the problem. If that theory doesn’t fix the problem, we can go to the next theory on our list, perform the test for that theory until we find one that resolves the issue.
Now that we think we have a fix we can document the process for applying that fix and our production environment. Our change control team will give us the time and the date, and we’ll be able to implement that plan into production, and then verify that the system is indeed working with our fix in place. With this fix in place, we can then document everything that we learned during this process. If this problem occurs again, we’ll have a database or a knowledge base of notes that we can use to solve the problem and make sure that everything is working properly.