N “anti-hash” or “similarity hash” / similarity measure algorithm

algorithmsdifferencehashing

The idea of hashes is they get drastically different results for even the smallest change of data.

What I am asking for is the opposite of that. An algorithm that will produce proximity hash values for data that is approximately the same. Regardless of where the difference is, to only measure the extent of the difference, and the resulting hash value is closer or further from the original hash value depending on whether the second data set is more or less similar to the original.

I do not target any particular data type, raw byte arrays would be fine.

Best Answer

The idea of hashes is they get drastically different results for even the smallest change of data.

No, it is not.

The idea of hashes is that they map a larger, potentially infinite input space to a smaller, finite, usually fixed output space. E.g. SHA-3 maps infinitely many octet strings into 2512 bitstrings.

What you are talking about is one of the properties of a cryptographically secure message digest, which is a (very small) special case of a hash function.

For example, the hash functions (aka fingerprints) that are used to detect copyright violations on online video platforms would be useless if they had this property.

One of the most well-known hash functions that has the opposite property, namely that similar inputs generate similar outputs, is soundex: an algorithm that produces similar hash values for words that sound similar.

What I am asking for is the opposite of that. An algorithm that will produce proximity hash values for data that is approximately the same. Regardless of where the difference is, to only measure the extent of the difference, and the resulting hash value is closer or further from the original hash value depending on whether the second data set is more or less similar to the original.

This sounds more like a similarity measure than a hash function. In particular, there is nothing in your description that implies a small, fixed-size, finite output space.

Related Topic