How should I be handling checksum collisions in the application

hashhash-collisionlanguage-agnostic

I have a part of my application that stores files. Because we could potentially be adding many of the same file, I am first keeping a hash of each file. If two files have the same hash, then we throw out one, and both "references" to that file point to the same physical file.

  1. How much should I be worried about hash collisions?

  2. In the case of a collision what should I do? The whole crux of my code so far depends on there not being two different files with the same hash. In the event of a collision right now, my app would throw out a legitmately different file and point to the file with the same hash.

  3. Should I be using something other than MD5? Does SHA-1 have a better collision rate?

Best Answer

Unless you're in some really REALLY critical application, do not worry about hash collisions. They are so rare that many things assume they are not going to happen, and catastrophic things will happen to these things if that assumption ends up being false just once.

SHA1 has a larger output space than MD5 (and fewer attacks are known on it, too), so it's definitely not a worse choice. If you are afraid of someone actively colliding your hashes, perhaps a later variant of SHA, such as SHA-256, might be a good idea.