Say you have two hashes H(A)
and H(B)
and you want to combine them. I've read that a good way to combine two hashes is to XOR
them, e.g. XOR( H(A), H(B) )
.
The best explanation I've found is touched briefly here on these hash function guidelines:
XORing two numbers with roughly random distribution results in another number still with roughly random distribution*, but which now depends on the two values.
…
* At each bit of the two numbers to combine, a 0 is output if the two bits are equal, else a 1. In other words, in 50% of the combinations, a 1 will be output. So if the two input bits each have a roughly 50-50 chance of being 0 or 1, then so too will the output bit.
Can you explain the intuition and/or mathematics behind why XOR should be the default operation for combining hash functions (rather than OR or AND etc.)?
Best Answer
xor
is a dangerous default function to use when hashing. It is better thanand
andor
, but that doesn't say much.xor
is symmetric, so the order of the elements is lost. So"bad"
will hash combine the same as"dab"
.xor
maps pairwise identical values to zero, and you should avoid mapping "common" values to zero:So
(a,a)
gets mapped to 0, and(b,b)
also gets mapped to 0. As such pairs are almost always more common than randomness might imply, you end up with far to many collisions at zero than you should.With these two problems,
xor
ends up being a hash combiner that looks half decent on the surface, but not after further inspection.On modern hardware, adding usually about as fast as
xor
(it probably uses more power to pull this off, admittedly). Adding's truth table is similar toxor
on the bit in question, but it also sends a bit to the next bit over when both values are 1. This means it erases less information.So
hash(a) + hash(b)
is better thanhash(a) xor hash(b)
in that ifa==b
, the result ishash(a)<<1
instead of 0.This remains symmetric; so the
"bad"
and"dab"
getting the same result remains a problem. We can break this symmetry for a modest cost:aka
hash(a)*3 + hash(b)
. (calculatinghash(a)
once and storing is advised if you use the shift solution). Any odd constant instead of3
will bijectively map a "k
-bit" unsigned integer to itself, as map on unsigned integers is math modulo2^k
for somek
, and any odd constant is relatively prime to2^k
.For an even fancier version, we can examine
boost::hash_combine
, which is effectively:here we add together some shifted versions of
lhs
with a constant (which is basically random0
s and1
s – in particular it is the inverse of the golden ratio as a 32 bit fixed point fraction) with some addition and an xor. This breaks symmetry, and introduces some "noise" if the incoming hashed values are poor (ie, imagine every component hashes to 0 – the above handles it well, generating a smear of1
and0
s after each combine. My naive3*hash(a)+hash(b)
simply outputs a0
in that case).(For those not familiar with C/C++, a
size_t
is an unsigned integer value which is big enough to describe the size of any object in memory. On a 64 bit system, it is usually a 64 bit unsigned integer. On a 32 bit system, a 32 bit unsigned integer.)