Data Structures – Optimal Way of Storing Sorted Tree in Memory

data structurestrees

Wondering what the general structure is for storing a tree in memory most efficiently (for access operation, and space).

For the access operation, it should take advantage of contiguous memory (memory locality) somehow which I am having a hard time determining how that should work for a tree. This way it can take advantage of the cache better during traversal or random subtree access. Also assume only concerned with in-memory structure, not going onto disk or anything. I am specifically thinking of B+-trees, but any BST would probably work I am not sure.

Specifically wondering how to organize the pointers for efficiency.

For example, say you have this JS tree:

var tree = {
  a: {
    b: {
      c: {
        d: {
          e: 100,
          f: 200,
          g: 300
        },
        h: {
          i: 400,
          j: 500,
          k: 600
        }
      }
    }
  }
}

If you imagine that this is actually implemented as a hash table (the keys are hashed to get the value), then what is the optimal layout in memory for the tree. Each of the layers would be it's own hash table, which might have an underlying array as its structure. This means each level's nodes are in the same region as each other, but probably not next to each other.

If instead of a hash table, it was implemented as some sort of linked list, each level could have the children next to each other, but the lists themselves might be in arbitrary places in memory still. The question is, can the lists be organized so they are also close in memory somehow to make things more efficient. Or perhaps the lists have a standard structure/size range, so we can just jump to a spot in memory for faster lookups. That sort of stuff.

Update

Given roughly the limitations of the need for a few million pieces of data and ~500MB of space.

Best Answer

The optimal layout will almost certainly depend on application. That said, anything that fits as much as possible that will be accessed together into a cache line (typically 64 bytes in current processors) will do the job. For a search tree, assuming you have reasonably small keys, storing a sorted list of as many keys and pointers as you can fit into the cache line is probably the best you can achieve.

Which is to say: don't arbitrarily limit yourself to a binary tree, use the available space in a cache line to minimize the number of cache lines you need to hit. Basically, the same considerations that lead to using B+ trees for on-disk structures, but with a smaller target size.

For string keys, remove the common prefix of all keys in the node, so you can fit more items in. You may need to store variable length keys, so being able to work with a variable number of items in each node is important.