Python – How to perform consistent hashing on any Python object that works with hash()

cachinghashingpython

I have a Python library that performs a kind of calculation given a parameter-object. A requirement of the parameter object is that it be both hashable and serializable. It's a long calculation, so it memoizes itself. Let's say this is implemented something like this:

# Notes: params *must* be serializable and hashable;
#        equal params *always* produce equal results.
def long_calc(params):
   if params in long_calc._cache:
      return long_calc._cache[params]
   result = _long_calc_but_with_math(params)
   long_calc._cache[params] = result
   return result
long_calc._cache = {}

In particular, this calculation is really slow, so I would like to add an optional feature to the function such that, if you provide it with a cache location, it will check for a cache file and either load/return it instead of running the calculation or, if that file is not found, will calculate the result and save it before returning it. This would be something like this:

def long_calc(params, cache=None):
   if params in long_calc._cache:
      return long_calc._cache[params]
   if cache is not None and os.path.isfile(cache):
      return _load_calc_result(cache)
   result = _long_calc_but_with_math(params)
   long_calc._cache[params] = result
   if cache is not None:
      _save_calc_result(cache, result)
   return result
long_calc._cache = {}

In typical use, this calculation will be run many times with many sets of parameters, and a given Python instance will likely examine many results. Accordingly, I would like to make this easier for the user and allow them to provide just a cache directory; the long_calc function would then be responsible for ensuring that its contents represented a correct/unique hashing of the cached data. My first idea for how to do this was to use the parameter-hashes as directory names; each directory represented a set of parameter values, all of which have the same hash and thus all collide in this schema. The parameters themselves would be serialized out to a file key-001.pickle and the result to val-001.pickle so that exact parameters could be checked and collisions resolved. I realize this is slow, but the calculations are plenty long enough to justify the wait. For context, this is scientific software, so security is not really a concern, but reproducibility is. I wrote this approach up something like this:

def long_calc(params, cache=None):
   if params in long_calc._cache:
      return long_calc._cache[params]
   if cache is not None:
      cache = _cache_filename(params, cache)
      if os.path.isfile(cache):
         return _load_calc_result(cache)
   result = _long_calc_but_with_math(params)
   long_calc._cache[params] = result
   if cache is not None:
      _save_calc_result(cache, result)
   return result
long_calc._cache = {}

def _cache_filename(params, cache_dir):
   # Get a hash for the params:
   h = hash(params)
   # Turn it into a directory name
   hstr = ('p' if h > 0 else 'n') + str(abs(h))
   hdir = os.path.join(cache_dir, hstr)
   # make it if it doesn't exist
   os.makedirs(hdir, mode=0o755)
   # look for a key that either matches or is not yet filled
   k = 0
   while True:
      # get the keyfile's name
      kfilename = os.path.join(hdir, 'key_%d.pkl' % k)
      if not os.path.isfile(kfilename):
         # Claim this spot
         _save_params(kfilename, params)
         break
      v = _load_params(kfilename)
      if params == v:
         break
      k = k + 1
   # return the value file that matches the key file 
   return os.path.join(hdir, 'val_%d.pkl' % k)

For the record I know there's a race condition here and am not worried about it.

This approach worked just fine and tested just fine… until I restarted my python instance. It ends up that python's hash function is intentionally salted as a security measure, so the line h = hash(params) isn't finding a consistent hash, which is what I need. This puts me in a bit of a bind because previous versions of the software have already established the norm that the params need to be hashable (i.e. via the hash() function) to be valid. Changing this requirement to instead be that the params object must be hashable by some other library's schema will break code unless the other library has a drop-in replacement for hash().

TLDR; Question 1: Is there a drop-in replacement for hash(x) that yields a consistent hash of x for any (or almost any) x normally hashable by hash(x)? Note that I'm willing to sacrifice a few unusual edge cases regarding the type of x if there's a drop-in for this that is pretty close.

Other answers about this kind of question have pointed at hashlib, but it looks to me like with this library where I would need to convert objects like frozensets to a unique string of bytes for hashing, which means it doesn't have a clear replacement for hash(). (I can't guarantee that equal parameter objects will have identical serialization strings, only that, once deserialized, the objects will again be equal).

TLDR; Question 2: Is there an easy way to make an object that is hashable (via hash()) and serializable (into a byte-string that can be different from that of another equal object) work with existing hash-libraries like hashlib?

A bit of research shows that you can pass Python the environment variable PYTHONHASHSEED=0 to disable salting. This is basically what I want, but for various reasons I don't think it's a good idea to force/require my users to do this themselves, and, as far as I can tell, you can't update the hash-seed during a python process. This has led me to the uncomfortable conclusion that the best way to get at this problem is possibly to fork/exec and have the child-process manage the calculation/caching of the result with an updated seed. I think that this is a pretty bad solution, and it's hard for me to imagine that there would be a consistent hash algorithm deep in python that just can't be accessed in any other way than this one.

TLDR; Question 3: Is there a way to temporarily disable the python hash salting aside from starting a new python process?

One last piece of context is that it's okay if a cache directory for one node/OS/python-version isn't compatible with that of another. But if I have a local cache on my local machine that is being updated within the same context, it should all work correctly. (Also it would be even better if the cache were consistent across all these things! it's just not required of a solution.)

Any other solutions that fall within the constraints I've laid out would also be very welcome! Thanks in advance.

Best Answer

Change your cache file so that it doesn’t map hash code to result of calculation, but parameters to result of calculation.

Since the calculation takes very long, having to match the parameters shouldn’t be a big deal, even though it is somehow inefficient. And of course you _can_calculate hash codes for parameters that you compared in memory.