What *exactly* gets screwed when I kill -9 or pull the power

corruptionelectrical-powerkill

Set-Up

I've been a programmer for quite some time now but I'm still a bit fuzzy on deep, internal stuff.

Now. I am well aware that it's not a good idea to either:

  1. kill -9 a process (bad)
  2. spontaneously pull the power plug on a running computer or server (worse)

However, sometimes you just plain have to. Sometimes a process just won't respond no matter what you do, and sometimes a computer just won't respond, no matter what you do.

Let's assume a system running Apache 2, MySQL 5, PHP 5, and Python 2.6.5 through mod_wsgi.

Note: I'm most interested about Mac OS X here, but an answer that pertains to any UNIX system would help me out.

My Concern

Each time I have to do either one of these, especially the second, I'm very worried for a period of time that something has been broken. Some file somewhere could be corrupt — who knows which file? There are over 1,000,000 files on the computer.

I'm often using OS X, so I'll run a "Verify Disk" operation through the Disk Utility. It will report no problems, but I'm still concerned about this.

What if some configuration file somewhere got screwed up. Or even worse, what if a binary file somewhere is corrupt. Or a script file somewhere is corrupt now. What if some hardware is damaged?

What if I don't find out about it until next month, in a critical scenario, when the corruption or damage causes a catastrophe?

Or, what if valuable data is already lost?

My Hope

My hope is that these concerns and worries are unfounded. After all, after doing this many times before, nothing truly bad has happened yet. The worst is I've had to repair some MySQL tables, but I don't seem to have lost any data.

But, if my worries are not unfounded, and real damage could happen in either situation 1 or 2, then my hope is that there is a way to detect it and prevent against it.

My Question(s)

Could this be because modern operating systems are designed to ensure that nothing is lost in these scenarios? Could this be because modern software is designed to ensure that nothing lost? What about modern hardware design? What measures are in place when you pull the power plug?

My question is, for both of these scenarios, what exactly can go wrong, and what steps should be taken to fix it?

I'm under the impression that one thing that can go wrong is some programs might not have flushed their data to the disk, so any highly recent data that was supposed to be written to the disk (say, a few seconds before the power pull) might be lost. But what about beyond that? And can this very issue of 5-second data loss screw up a system?

What about corruption of random files hiding somewhere in the huge forest of files on my hard drives?

What about hardware damage?

What Would Help Me Most

  1. Detailed descriptions about what goes on internally when you either kill -9 a process or pull the power on the whole system. (it seems instant, but can someone slow it down for me?)

  2. Explanations of all things that could go wrong in these scenarios, along with (rough of course) probabilities (i.e., this is very unlikely, but this is likely)…

  3. Descriptions of measures in place in modern hardware, operating systems, and software, to prevent damage or corruption when these scenarios occur. (to comfort me)

  4. Instructions for what to do after a kill -9 or a power pull, beyond "verifying the disk", in order to truly make sure nothing is corrupt or damaged somewhere on the drive.

  5. Measures that can be taken to fortify a computer setup so that if something has to be killed or the power has to be pulled, any potential damage is mitigated.

  6. Some information about binary files — isn't it true that the apache binary file or some library could have a random byte or two corrupted in the middle, that wouldn't come out and cause a problem until later? How can I assure myself that this didn't happen as a result of the power pull or the kill?

Thanks so much!

Best Answer

Pulling the power causes everything to stop in flight, with no warning. kill -9 has the same effect on a single process, forcefully terminating it with a SIGKILL.

If a process is killed by kernel or power outage, it doesn't do any clean-up. That means you could have half-written files, inconsistent states, or lost caches. You usually don't have to worry about any of this because of journaling, exit status and battery backup.

Temporary files in /tmp will be automatically gone if they are in tmpfs, but you may still have application-specific lock files laying around to remove, like the lock and .parentlock for firefox.

Most software is smart enough to retry a transaction if it doesn't record a successful exit status. A good example of this is a typical mail system. If a message is being delivered, but gets cut off in the middle, the sender will retry later until it gets a success.

Your filesystem is probably journaled. If you are moving or writing a file and it dies mid-stream, the journaled file system will still reference the original. The journaled filesystem will make changes non-destructively, leaving the old copy, then only reference the new copy as a last step before reclaiming space the old copies occupied on disk.

Now if you have a RAID array, it has all kinds of memory buffers to increase performance and provide reliability in a power failure. Most likely your filesystem will not know about the caches in the device and their state, so it thinks a change has been committed to disk, but it is still in the RAID cache somewhere. So what happens when the power dies? Hopefully you have a functional battery in your RAID enclosure and you monitor it. Otherwise you have a corrupt file system to fsck.

Yes, a few bits can become corrupted in a binary, but I would not worry about that much on modern hardware. If you are really paranoid, you can monitor the health of your disks and RAID with the appropriate tools, but you should be doing that anyway. Do regular backups and get an Uninterruptible Power Supply.

Related Topic