Do to determine the root cause of a Windows server hanging/freezing

windows-server-2008-r2

We set up a new server here a few weeks ago that I am informally responsible for managing.

Almost everything works perfectly except for one thing: Every so often it hangs without warning.

Some facts about this hang:

  • It is not a single application or service; the entire system is non-responsive.
  • Nothing is displayed (monitor acts as though there's no VGA signal).
  • The power LED is on and the fans are running.
  • Pressing the power button does nothing (normally it would shut the machine down).
  • Pings generally time out; once it did respond, another time I got "destination host unreachable".
  • Event logs show nothing (literally nothing at all) from before the hang until the hard reboot.
  • There are no performance problems, strange errors, or other obvious signs of impending doom leading up to the eventual hang.
  • The machine is generally not heavily loaded (it's for development, not production), and the hangs appear to be occurring at non-peak times of day (between midnight and 6 AM).

Some additional facts about the machine/environment:

  • Windows Server 2008 R2
  • Running SQL Server 2008 and IIS (not much else)
  • All drivers up to date, patches installed, etc.
  • No vendor-supplied diagnostics (not "top tier").
  • The machine is completely new, not merely reformatted or repurposed. No recent changes although the machine is less than a month old to start with.

I don't expect any easy answers here. What I'd like to know his I can methodically determine the root cause of this problem, be it a misbehaving service, defective hardware, or something else.

Is there any kind of logging I can set up that will help me get to the bottom of this? Any hardware diagnostics or remote monitoring? Anything else I can do to help me discover what's actually happening, or at least be able to eliminate what isn't wrong?

Just to reiterate, I really don't want to start speculating about possible causes and take a trial-and-error approach, because it's going to be at least several days at a time before I would have conclusive results. I'm looking for solutions to reliably trace the problem to its source.