Diagnostics for a server that keeps turning off

supermicro

I have a 1U supermicro box that's a few years old and off-warranty. Recently it has begun randomly shutting down. It will stay up for anywhere from a few hours to a week and then stop responding. The IPMI console shows it as powered on but it's completely non-responsive.

I'd v much like to fix this machine as the owners are very budget constrained. It has CentOS 7 presently.

What I've looked for:

  • IPMI logs – empty
  • System logs – nothing relevant
  • SAR – nothing interesting
  • Hardware sensors – fans are on, CPU temp is nominal

What I've tried:

  • supermicro diagnostics – the (UEFI) image won't boot properly on this system
  • memtest+ – ran for 24 hours with no incident

Given that it has redundant power supplies Im thinking this isn't the issue. This leaves CPU and mainboard.

  • What other tests can I run?
  • What other log sources could I look into?
  • What else might be failing?

Edit:

Started up said machine and let it run until it quit (12 hours?). The IPMI window shows that it's stuck on the boot page of all things.

enter image description here

It had been booted and running. This makes me think it's a main board issue. There aren't any USB devices plugged in and it's well and truly wedged.

Best Answer

I wouldn't completely rule out the PSU. If they're redundant, you could try running with only one, then the other.

Can you get replacement CPU(s)? Used Xeons are pretty cheap, and you can still sell them afterwards. If it's a multi CPU system, try removing all but one.

Does the system have a separate, replaceable VRM for the CPU?

It could well be the mainboard, but that probably means the machine is dead.

Related Topic