Centos – Server becomes unresponsive after several days

centosdedicated-serversupermicro

I have a dedicated server that becomes unresponsive after several days of running-time.
Sometimes it's 1 day and sometimes it's 5 days but it always happens and I can neither reach the server via ssh nor log into the supermicro control panel.

I have to power off and on the server from my provider's control panel to make the server accessible again.

The server isn't running something heavy, just a LAMP setup.

How can I diagnose this, to see what's wrong and to fix the issues?

The only prominent thing I found is in the messages file:

Aug 16 18:01:50 server1 kernel: sbridge: HANDLING MCE MEMORY ERROR
Aug 16 18:01:50 server1 kernel: CPU 0: Machine Check Exception: 0 Bank 7: 8c00004000010093
Aug 16 18:01:50 server1 kernel: TSC 0 ADDR 2804ab80 MISC 214042c286 PROCESSOR 0:306e4 TIME 1439766110 SOCKET 0 APIC 0
Aug 16 18:01:50 server1 kernel: EDAC MC0: CE row 6, channel 0, label "CPU_SrcID#0_Channel#3_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0093 (ch=3), addr = 0x2804ab80 => socket=0, Channel=3(mask=8), rank=2

Best Answer

The machine reports a RAM error and even tells you which module is affected. Recommendation: replace that module and see if the problem goes away.

Related Topic