Windows – Performing root-cause analysis

debuggingtroubleshootingwindows

I want to learn more about how to perform root-cause analysis. More times than not, our department tells the user to try rebooting (thier Windows XP system), which actually "fixes" a good number of problems. When I am in a hurry (and sometimes getting paid hourly contributes to this) I might try to find a workaround in order to get the problem solved quickly instead of actually performing root-cause analysis.

Most of the time I am looking in log files or the event viewer for this information. Sometimes I will use the Sysinternals tools or occasionally run a packet sniffer. I probably don't use the Sysinternals programs as much as I should. Some specific insight on how you use which pf these tools, when and why would also be helpful.

I know this is a wide open question but could you please briefly explain your methodology, tools, etc. that you use? It looks like a lot of admins on SF use a more in-depth process which I would like to learn more about. If this helps narrow down the question any, I would be most interested in tools, tips, tricks, etc. relevant to Windows servers & clients within an AD environment.

Best Answer

Figuring out the root cause of a problem depends on the problem -- Your initial instinct to look at log files/sysinternals tools/packet sniffers is generally correct.
I would add running the MS Malicious Software Removal Tool and a good AV program on Windows systems (and ensuring that they don't have something like CyberDefender or other AV-trojan-malware.

The folks at Stack Exchange are proponents of the "5 Whys" method (http://en.wikipedia.org/wiki/5_Whys, also this nice short PDF that shows it in action). It is a pretty valuable tool for doing root cause analysis.


Beyond that I'll paint two broad categories and some of the questions I usually ask/things I check:

Mysterious behavior not related to the network
e.g. "Word keeps crashing on me"

Basic questions to ask:

  1. What Changed?
    (Dont take "nothing" for an answer -- it is the first lie. New software, patches, etc. all count.)
  2. What were you doing when you had the problem?
    (Try to extract as much detail as possible here -- in my example above "I hit the hotkey for insert initials and the program crashed")
  3. Did it ever work before?
    (If so, start looking at stuff from (1) above)
  4. Can you reproduce the problem on your system?
    (If so that's a good sign: A tech support call to the vendor may help. If not you'll need to look at the user's system for the rest of these questions.)
  5. What's different about the user's environment than your environment?
  6. Is the user's hardware suspect (Run a memory test, look for SMART errors from the hard drive, etc.)
  7. If you've gotten this far (hardware checks out, software checks out, no viruses, no malware) go visit the user for a day. Observe their work habits.
    My company once had a mysterious system lock-up that related to clicking the mouse at a specific frequency (We still don't know why, but we had to watch a user doing it and practice for a day in order to be able to reproduce it reliably)

Problems related to the network

A lot of this is similar, but with some more specific guidance.

  1. What Changed?
    (Yeah, you always start there)
  2. What is broken?
    • Can you reach web pages? Is it just one that's down? If so Is it down for everyone or just you?
    • Can you ping stuff on the internet by name?
      How about by IP? How far does the traceroute get?
  3. When is it broken?
    • Always the same time of the day?
    • For a brief period every N days?
    • Randomly (is it REALLY random? Plot it on a calendar...)
  4. Is there something odd about the remote site?
    • Look at DNS - If it's round-robin'd there could be remote-side breakage
    • Are we talking about the other end of a VPN? What's up with the VPN (logs!)?
  5. Is there something odd about the local site?
    • Check your local firewall
    • Check any "filtering software"
  6. Check with your ISP to see if there are any known issues
  7. Check sites like http://www.internetpulse.net/ for known network-wide issues
  8. Check out the user's machine
    (TCP settings, etc. - Usually not the problem, but sometimes.)
Related Topic