Algorithms – Hash Function with Uniqueness Guarantees and Controllable Entropy

algorithmscryptographyhashing

Is there a class of hash functions that meets the following specs:

Upper and lower bound can be specified
Uniqueness is guaranteed as long as the input is between the upper and lower bounds
Amount of entropy is controllable, or at least high and evenly distributed

An example of a low entropy hash function that produces unique results, and allows the upper bound to be specified is

int hash(int x,int upperBound) {
    return x - (upperBound * (x \ upperBound));
}

This would produce a number between [0, upperBound), resetting back to 0 when the number can be divided by upperBound.

So lets say our upper bound is 20^3, that gives us 46656 numbers I believe. Feeding a number between 0 and 46655 should produce a unique result. Any number over will produce a collision. Providing the same number should always give the same result. Being able to control entropy would be a plus, but if it's evenly distributed and high then that will work fine too.

The end goal is to turn the number into an alpha numeric representation which can quickly be looked at to determine if it has been changed since the last time a number was requested. I should not receive the same number until all numbers have been used.

Best Answer

OK, so I feel I've done enough research to at least begin to move things in a helpful direction.

The term for what you are looking for generally is a "perfect hash function", with the addition that you want some potentially manageable degree of randomness.

In general the state of the art is that you use an algorithmic method that generates a mapping, and then save that final mapping. There are lots of interesting ways to go about this (and more), but the problem comes down to the fact that the probability of a collision with any random method is likely going to guarantee a collision before you exhaust your input space. I'll provide an example of this now, in C# code (copy/paste-able code, I believe):

  System.Text.StringBuilder Sb = new System.Text.StringBuilder();

  System.Collections.Generic.HashSet<string> results = new System.Collections.Generic.HashSet<string>();

  using (System.Security.Cryptography.SHA512 hash = System.Security.Cryptography.SHA512Managed.Create() ) {
    System.Text.Encoding enc = System.Text.Encoding.UTF8;

    for (int input = 0; input < 10000; input++) {
      Byte[] result = hash.ComputeHash(enc.GetBytes(input.ToString()));

      foreach (Byte b in result) {
        Sb.Append(b.ToString("x2"));
      }

      results.Add(Sb.ToString().Substring(Sb.Length - 3));
    }

  }

What I'm doing is giving an input value of 0-9999, converting it to a SHA512 hash, and then only taking the last 3 alphanumeric digits. You can then compare the size of the HashSet to the inputs to determine just how many duplicates you have.

The result: with 10000 inputs you only get 3735 unique results. Ouch - that's a lot of collisions! If you change your request to mapping to 4 digits (change the above final line of code to Sb.Length - 4), you get 9303 - not bad! If you allow 5 digits of output you get 9955 - still with the collisions, and we've massively extended our allowable output and only allow 10000 inputs!

So if you used such a method you need to restrict your maximum inputs considerably and likely increase the size of the output too.

If you didn't care so much about randomness, you could use (x + 18) % 46656, as in replacing the last line with this:

results.Add(((input + 18) % 46656).ToString());

This results in zero collisions and runs a lot faster, too. The output is not at all random, of course, especially if you move up in sequence.

Now, with some hand-tuning I did manage to come up with this little doozy:

results.Add(((input * (40001) + 11) % 46656).ToString());

So take the input, multiply it by 40001, add 11, then "wrap-around" it to your maximum input. On 0-46656 as input it generates no collisions, and yet it jumps all over the place - f(1)->400012, while f(2)->33357. If you skip using the first few inputs (0-2) the function that generates these numbers is even more opaque, and as this is non-linear it is not easy to find the function that generates this set. Multiply by an odd number and you can use this as a kind of "seed" to your function - prime numbers might be an even better choice! Small numbers result in less jumpiness, while multiplying by the same number you are taking the mod of is...well, always 0. I suspect somewhere in the middle/high side is ideal.

Then parse the integer output into your desired string format, and tadah, Bob's your uncle!

Now if you want something more random with measurable entropy and that takes a very long time to reverse-engineer...well, that's going to be harder to arrange by far.

I hope this is at least helpful on getting you closer to where you want to go.

Related Solutions

Password Security – Is It More Secure to Hash a Password Multiple Times?

This is more suited on security.stackexchange but...

The problem with

hash1(hash2(hash3(...hashn(pass+salt)+salt)+salt)...)+salt)

is that this is only as strong as the weakest hash function in the chain. For example if hashn (the innermost hash) gives a collision, the entire hash chain will give a collision (irrespective of what other hashes are in the chain).

A stronger chain would be

hash1(hash2(hash3(...hashn(pass + salt) + pass + salt) + pass + salt)...) + pass + salt)

Here we avoid the early collision problem and we essentially generate a salt that depends on the password for the final hash.

And if one step in the chain collides it doesn't matter because in the next step the password is used again and should give a different result for different passwords.

Unit Testing – Testing Random/Non-Deterministic Algorithms

What you actually want to test here, I assume, is that given a specific set of results from the randomiser, the rest of your method performs correctly.

If that's what you're looking for then mock out the randomiser, to make it deterministic within the realms of the test.

I generally have mock objects for all kinds of non-deterministic or unpredictable (at the time of writing the test) data, including GUID generators and DateTime.Now.

Edit, from comments: You have to mock the PRNG (that term escaped me last night) at the lowest level possible - ie. when it generates the array of bytes, not after you turn those into Int64s. Or even at both levels, so you can test your conversion to an array of Int64 works as intended and then test separately that your conversion to an array of DateTimes works as intended. As Jonathon said, you could just do that by giving it a set seed, or you can give it the array of bytes to return.

I prefer the latter because it won't break if the framework implementation of a PRNG changes. However, one advantage to giving it the seed is that if you find a case in production that didn't work as intended, you only need to have logged one number to be able to replicate it, as opposed to the whole array.

All this said, you must remember that it's called a Pseudo Random Number Generator for a reason. There may be some bias even at that level.

Best Answer

Related Solutions

Password Security – Is It More Secure to Hash a Password Multiple Times?

Unit Testing – Testing Random/Non-Deterministic Algorithms

Related Topic