Linux – How to Minimize Linux Server Reboots

kernellinuxpatch-management

Last week there were a fair few comments on a slashdot article about whether Unix (or Linux) machines ever need to be rebooted. More than a few of the commenters mentioned having machines with uptimes of several years.

As I understand it, linux boxes need to be rebooted fairly often to apply kernel patches, especially security related ones (such as the ac1db1tch3z exploit). Running uname -r after a 'yum update kernel' seems to show that the old kernel isn't loaded until a reboot.

My question is, how are these boxes achieving multiple year uptimes given this? A few possible solutions I've thought of

  1. The machines aren't production and/or exposed to users so the security patches aren't as much of a concern.
  2. All of the posters are using live patching services such as Ksplice
  3. The kernel security patches can be applied by reloading modules rather than the entire kernel.
  4. uname -r is reflecting incorrect information after a kernel patch, and the updated kernel is loaded after all.

Are any of these explanations reasonable, or is there something I'm missing in my understanding? Is there another way to minimize the two dozen or so reboots necessary from the last two years?

Best Answer

I think the only time one need to reboot Linux machine is to replace the kernel. I have several machines running for more then 2 years but I maintain them based on "If it ain't broke, don't fix it" principle and that is how I achieve the uptime. Of course, if your servers exposed to external threats you will need to apply security fixes periodically, and some of them will require new kernel. I'm not aware of any way to do it reliably without rebooting the machine. There may be some tricks here but there is a good chance that you will compromise stability in the process and you will need to take machine into a single user mode. You will technically achieve the uptime but the machine will not be available to the end users during this time, so what's the point?

If the uptime is really critical for you, you may be interested in some form of HA/clustering solution when you can reboot one node of a cluster without affecting availability of the entire system. Otherwise just reboot.