Garbage Collection Memory Usage – Garbage Collection & Memory Leaks on Hash Tables

garbage-collectionhashingmemory usage

I was reading R. Read's How to be a programmer, and I came accross something I didn't understand:

…even with garbage collection, you can fill up all memory with
garbage. A classic mistake is to use a hash table as a cache and
forget to remove the references in the hash table. Since the reference
remains, the referent is noncollectable but useless. This is called a
memory leak. You should look for and fix memory leaks early. If you
have long running systems memory may never be exhausted in testing but
will be exhausted by the user.

So let's say I have a dictionary structure in python, indexed on md5 hashes (is this the kind of hashtable he's referring to?). Eg:

x = {}
x['c4ca4238a0b923820dcc509a6f75849b'] = 1
x['c81e728d9d4c2f636f067f89cc14862c'] = 2

Can someone now walk me through his example? What do I have to do now concretely to cause a memory leak?

Best Answer

It's pretty basic. Walk through the code:

x = {}

Memory reserved for overhead

x['c4ca4238a0b923820dcc509a6f75849b'] = 1

Memory for one key/value pair allocated

x['c81e728d9d4c2f636f067f89cc14862c'] = 2

Memory for two key/value pairs allocated

Now imagine we do this 10,000 times. We will have allocated space for 10,000 key/value pairs. Each hash added increases memory usage. If these values will indeed never be used again, they are "useless", but since you've told python to save them, they will not be collected. You'd need to remove the reference like this:

del x['c4ca4238a0b923820dcc509a6f75849b']

You have to do this because in general, the garbage collector can't know that you are never going to look this value up.

The case talked about here is using a dictionary as a cache, and presumably you don't really know if you will need a particular value. Also, presumably, there's no real limit to the number of values. So with a long-running app, you could conceivably exhaust memory. You'd need to have some scheme to delete elements from the cache when they are no longer needed, presumably by testing recency of use.

GC can't help you here any more than it can help you if you attempt to allocate a 100 gigabyte array. You've told it explicitly through referencing not to discard any of the data.