Linux – Server Random Freeze and Boot Only with Cold Boot

centoscrashfreezelinuxserver-crashes

im facing extremely weird issue regards one server, it random freeze/hang with no output on server, and not responding to short keys, and required cold boot, when boot with cold boot, no errors on boot screen at all.

It's not freezing under heavy load at all, with around 9-20% cpu wheb crash, load average around 2-5(12 core cpu)
and 128gb ram

We tried check logs, nothing shows like kernal panics, or anything that relate to the issue itself.

In all the freezes after cold boot, when we check the log, we do see normal OOM reaper killing php procces (users reach limits) but nothing too abusive, but always on OOM,
Sometimes when server freeze in the log you see the current time, and sometimes like the it shows after thr current time of the crash few lines from older date, and freezes.

Nothing in logs can determine software related, or under heavy load, just normal operation, this is an upgraded machine from old one, that were stable for years..
The freezes are random, could be after a week server up, or two days or three weeks and etc…

Also we tried to extract vmcore dump of server freeze but still nothing catches there.

It's just freeze with not screen output, but server still running but not pringable, cant access ssh nothing, also kvm as i said show no output at all at screen.

Could it be related to maybe faulty hardware? As my suspension is about faulty RAM?

I'm extremely lost with this issue..
Thanks

Best Answer

  1. Make sure temperatures are good, CPU/RAM/CHIPSET/DISKS, I assume your are a linux user because of OOM, install lm-sensors, and check the temps with the sensors command.
  2. It's your RAM, run memtest86, be aware full test on 128GB can take a week.