C# – Is it possible to implement a well-distributed hash table without using the % operator

I'm looking to implement a fast, well-distributed hash table in C#. I'm having trouble choosing my hash-constraining function that takes an arbitrary hash code and "constrains" it so it can be used to index the buckets. There are two options that I see so far:

On one hand, you can make sure your buckets always have a prime number of elements, and to constrain the hash you simply modulo it by the number of buckets. This is, in fact, what .NET's Dictionary does. The problem with this approach is that using % is extremely slow compared to other operations; if you look at the Agner Fog instruction tables, idiv (which is the assembly code that gets generated for %) has an instruction latency of ~25 cycles for newer Intel processors. Compare this to around 3 for mul, or 1 for bitwise ops like and, or, or xor.
On the other hand, you can have the number of buckets always be a power of 2. You will still have to calculate the modulus of the hash so you don't attempt to index outside the array, but this time it will be less expensive. Since for powers of 2 % N is just & (N - 1), the constraining is reduced to a masking operation which only takes 1-2 cycles. This is done by Google's sparsehash. The downside of this is that we are counting on users to provide good hashes; masking the hash essentially cuts off part of the hash, so we are no longer taking all bits of the hash into account. If the user's hash is unevenly distributed, for example only the higher bits are filled out or the lower bits are consistently the same, then this approach has a much higher rate of collisions.

I am looking for an algorithm I can use that has the best of both worlds: it takes all bits of the hash into account, and is also faster than using %. It does not necessarily have to be a modulus, just something that is guaranteed to be in the range 0..N-1 (where N is the length of the buckets) and has even distribution for all slots. Does such an algorithm exist?

Thanks for helping.

Best Answer

Modern hash table implementations do not use the modulo function. They often use power of two sized tables and chop off unneeded bits. An ideal hash function would allow this. The use of modulo combined with prime number table sizes arose in the days when hash functions were generally poor, as they often are in .net development. I recommend reading about SipHash, a modern hash function, then reading about some other modern functions, such as xxHash.

I should explain why .net hash functions are often poor. In .net, programmers are often forced to implement hash functions by overriding GetHashcode. But .net does not provide the tools needed to ensure the programmer created functions are high-quality, namely:

encapsulation of the hash state in a structure or class
hash "add" functions, which add new data to the hash state (add a byte array, or a double, for example)
a hash "finalize" function, to produce the avalanche
encapsulation of the hash result -- in .net you get one choice, a 32 bit signed integer.

For more information about using a hash function result as a hash table index, please see the definitions of universal forms of hashing in this paper: Faster 64-bit universal hashing using carry-less multiplications

Best Answer

Related Solutions

Hashing – Visualizing Randomness in Hash Algorithms

How does Pearson hashing compare with other non-cryptographic hashing algorithms

Related Topic