Algorithms – Hash Function with Uniqueness Guarantees and Controllable Entropy

algorithmscryptographyhashing

Is there a class of hash functions that meets the following specs:

  • Upper and lower bound can be specified
  • Uniqueness is guaranteed as long as the input is between the upper and lower bounds
  • Amount of entropy is controllable, or at least high and evenly distributed

An example of a low entropy hash function that produces unique results, and allows the upper bound to be specified is

int hash(int x,int upperBound) {
    return x - (upperBound * (x \ upperBound));
}

This would produce a number between [0, upperBound), resetting back to 0 when the number can be divided by upperBound.

So lets say our upper bound is 20^3, that gives us 46656 numbers I believe. Feeding a number between 0 and 46655 should produce a unique result. Any number over will produce a collision. Providing the same number should always give the same result. Being able to control entropy would be a plus, but if it's evenly distributed and high then that will work fine too.

The end goal is to turn the number into an alpha numeric representation which can quickly be looked at to determine if it has been changed since the last time a number was requested. I should not receive the same number until all numbers have been used.

Best Answer

OK, so I feel I've done enough research to at least begin to move things in a helpful direction.

The term for what you are looking for generally is a "perfect hash function", with the addition that you want some potentially manageable degree of randomness.

In general the state of the art is that you use an algorithmic method that generates a mapping, and then save that final mapping. There are lots of interesting ways to go about this (and more), but the problem comes down to the fact that the probability of a collision with any random method is likely going to guarantee a collision before you exhaust your input space. I'll provide an example of this now, in C# code (copy/paste-able code, I believe):

  System.Text.StringBuilder Sb = new System.Text.StringBuilder();

  System.Collections.Generic.HashSet<string> results = new System.Collections.Generic.HashSet<string>();

  using (System.Security.Cryptography.SHA512 hash = System.Security.Cryptography.SHA512Managed.Create() ) {
    System.Text.Encoding enc = System.Text.Encoding.UTF8;

    for (int input = 0; input < 10000; input++) {
      Byte[] result = hash.ComputeHash(enc.GetBytes(input.ToString()));

      foreach (Byte b in result) {
        Sb.Append(b.ToString("x2"));
      }

      results.Add(Sb.ToString().Substring(Sb.Length - 3));
    }

  }

What I'm doing is giving an input value of 0-9999, converting it to a SHA512 hash, and then only taking the last 3 alphanumeric digits. You can then compare the size of the HashSet to the inputs to determine just how many duplicates you have.

The result: with 10000 inputs you only get 3735 unique results. Ouch - that's a lot of collisions! If you change your request to mapping to 4 digits (change the above final line of code to Sb.Length - 4), you get 9303 - not bad! If you allow 5 digits of output you get 9955 - still with the collisions, and we've massively extended our allowable output and only allow 10000 inputs!

So if you used such a method you need to restrict your maximum inputs considerably and likely increase the size of the output too.

If you didn't care so much about randomness, you could use (x + 18) % 46656, as in replacing the last line with this:

results.Add(((input + 18) % 46656).ToString());

This results in zero collisions and runs a lot faster, too. The output is not at all random, of course, especially if you move up in sequence.

Now, with some hand-tuning I did manage to come up with this little doozy:

results.Add(((input * (40001) + 11) % 46656).ToString());

So take the input, multiply it by 40001, add 11, then "wrap-around" it to your maximum input. On 0-46656 as input it generates no collisions, and yet it jumps all over the place - f(1)->400012, while f(2)->33357. If you skip using the first few inputs (0-2) the function that generates these numbers is even more opaque, and as this is non-linear it is not easy to find the function that generates this set. Multiply by an odd number and you can use this as a kind of "seed" to your function - prime numbers might be an even better choice! Small numbers result in less jumpiness, while multiplying by the same number you are taking the mod of is...well, always 0. I suspect somewhere in the middle/high side is ideal.

Then parse the integer output into your desired string format, and tadah, Bob's your uncle!

Now if you want something more random with measurable entropy and that takes a very long time to reverse-engineer...well, that's going to be harder to arrange by far.

I hope this is at least helpful on getting you closer to where you want to go.

Related Topic