I have a dedicated server that becomes unresponsive after several days of running-time.
Sometimes it's 1 day and sometimes it's 5 days but it always happens and I can neither reach the server via ssh nor log into the supermicro control panel.
I have to power off and on the server from my provider's control panel to make the server accessible again.
The server isn't running something heavy, just a LAMP setup.
How can I diagnose this, to see what's wrong and to fix the issues?
The only prominent thing I found is in the messages
file:
Aug 16 18:01:50 server1 kernel: sbridge: HANDLING MCE MEMORY ERROR
Aug 16 18:01:50 server1 kernel: CPU 0: Machine Check Exception: 0 Bank 7: 8c00004000010093
Aug 16 18:01:50 server1 kernel: TSC 0 ADDR 2804ab80 MISC 214042c286 PROCESSOR 0:306e4 TIME 1439766110 SOCKET 0 APIC 0
Aug 16 18:01:50 server1 kernel: EDAC MC0: CE row 6, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x2804ab80 => socket=0, Channel=3(mask=8), rank=2
Best Answer
The machine reports a RAM error and even tells you which module is affected. Recommendation: replace that module and see if the problem goes away.