Solaris 10 server seems to be shutting down by itself

dellsolarissolaris-10

Every few weeks one of our Solaris 10 servers becomes unresponsive. I can telnet to port 22 and get the SSH banner but I am unable to actually establish an SSH connection to it. It's a Dell R610 so I login via the DRAC Console and while I can press enter and get a new line but whenever I try to run a command such as 'prstat' the console hangs and I am unable to Control-C or anything else. I am also unable to send it a CTRL-ALT-DEL to reboot gracefully and have to end up doing a remote hard power-cycle.

Nothing strange appears in the logs and we have tried setting up crons to capture and append the output of prstat, iostat, vmstat, sar, etc to a file every minute to try and see what's causing this but all we see is that the machine is fine and then everything seems to stop.

We are also graphing metrics in Cacti and don't see anything. Like I said everything is normal and then data just stops.

The problem happened again last night and we have discovered in the 'last' output that the machine seems to start shutting down a couple of hours before it becomes unresponsive (no-one is shutting it down), here is the output:

reboot system boot Tue Nov 23 17:24 <– here is where I rebooted it.
reboot system down Tue Nov 23 15:01

There are no environmental or chassis alarms in the DRAC.

I've checked for any crons, etc that could be shutting down the server somehow, don't really see anything. I want to enable auditd but that requires a reboot and this is a major production system.

Can anyone offer any advice?

Dell R610
Solaris 10 5/09 s10x_u7wos_08 X86

Thanks,

Shane

Best Answer

Discovered that if I go into the BIOS->CPU Settings and Disable C-Settings the servers no longer crash. They have been up for over a month now while the other servers which didn't have the flag set still crashed.