Troubleshooting the dreaded 0x9C BSOD

bsodserver-crashesstoperrorwindows-server-2003

We have a Dell PowerEdge 2950 running Windows Server 2003 R2, Enterprise x64 with Service Pack 2 installed.

Recently, we've been experiencing multiple STOP errors occurring with that server. Fortunately it is in place as a fail over machine so it is not currently affecting our production environment. The error that shows up in the server log is this:

Event Type: Error
Event Source:   System Error
Event Category: (102)
Event ID:   1003
Description:
Error code 000000000000009c, parameter1 0000000000000004, 
parameter2 fffffadf90881240, parameter3 00000000f2000000, 
parameter4 0000000000060151.

So far the best I've been able to track down is that the 9C error is some sort of generic hardware problem. The other parameters have been no use in narrowing this one down.

There have been no hardware changes since the machine was brought into service last year. It has a twin box that is identical (the primary that this one acts as a fail over for) that is not experiencing the behavior. The last software change was on 4/16/2009 when several security updates were applied. The blue screens started happening on 5/9/2009.

Are there any diagnostics that may help with tis problem?

Best Answer

See Kazna3's answer at http://www.d-a-l.com/archive/index.php/t-49205.html He/she writes:

But first, the BSOD is pretty old. The 0x9C BUGCHECK is hardware related, well known. The rest of it concerns the processor, it's a processor fault or just the processor driver. :(

Have a look here for the explanation: 0x9C: MACHINE_CHECK_EXCEPTION (http://msdn2.microsoft.com/en-us/library/ms795775.aspx)

Microsoft used to advise this when we got it with the P4s:

Step 1) Update your BIOS (hardware patches called microcode updates ride here, if your processor or AMLI has an errata, it would be fixed here).

Step 2) Call hardware vendor immediately as this is a strict hardware error.

Step 3) Replace hardware, starting with CPU.

In other words, your hardware is likely borked. Possibly a brown-out, or high heat. Just because a component is solid-state doesn't mean it can't fail. Eg: RAM fails all the time - there's a reason it ships in static-resistant bags.