R – How to determine which kind of tree data structure to choose

data structurestree

Ok, so this is something that's always bothered me. The tree data structures I know of are:

Unbalanced binary trees
AVL trees
Red-black trees
2-3 trees
B-trees
B*-trees
Tries
Heaps

How do I determine what kind of tree is the best tool for the job? Obviously heaps are canonically used to form priority queues. But the rest of them just seem to be different ways of doing the same thing. Is there any way to choose the best one for the job?

Best Answer

Let’s pick them off one by one, shall we?

Unbalanced binary trees

For search tasks, never. Basically, their performance characteristics will be completely unpredictable and the overhead of balancing a tree won’t be so big as to make unbalanced trees a viable alternative.

Apart from that, unbalanced binary trees of course have other uses, but not as search trees.

AVL trees

They are easy to develop but their performance is generally surpassed by other balancing strategies because balancing them is comparatively time-intensive. Wikipedia claims that they perform better in lookup-intensive scenarios because their height is slightly less in the worst case.

Red-black trees

These are used inside most of C++’ std::map implemenations and probably in a few other standard libraries as well. However, there’s good evidence that they are actually worse than B(+) trees in every scenario due to caching behaviour of modern CPUs. Historically, when caching wasn’t as important (or as good), they surpassed B trees when used in main memory.

2-3 trees

B-trees

B*-trees

These require the most careful consideration of all the trees, since the different constants used are basically “magical” constans which relate in weird and sometimes unpredictable way to the underlying hardware architecture. For example, the optimal number of child nodes per level can depend on the size of a memory page or cache line.

I know of no good, general rule to distinguish between them.

Tries

Completely different. Tries are also search trees, but for text retrieval of substrings in a corpus. A trie is an uncompressed prefix tree (i.e. a tree in which the paths from root to leaf nodes correspond to all the prefixes of a given string).

Tries should be compared to, and offset against, suffix trees, suffix arrays and q-gram indices – not so much against other search trees because the data that they search is different: instead of discrete words in a corpus, the latter index structures allow a factor search.

Heaps

As you’ve already said, they are not search trees at all.

Related Solutions

C# – Tree data structure in C#

My best advice would be that there is no standard tree data structure because there are so many ways you could implement it that it would be impossible to cover all bases with one solution. The more specific a solution, the less likely it is applicable to any given problem. I even get annoyed with LinkedList - what if I want a circular linked list?

The basic structure you'll need to implement will be a collection of nodes, and here are some options to get you started. Let's assume that the class Node is the base class of the entire solution.

If you need to only navigate down the tree, then a Node class needs a List of children.

If you need to navigate up the tree, then the Node class needs a link to its parent node.

Build an AddChild method that takes care of all the minutia of these two points and any other business logic that must be implemented (child limits, sorting the children, etc.)

Java – How to determine if binary tree is balanced

Stumbled across this old question while searching for something else. I notice that you never did get a complete answer.

The way to solve this problem is to start by writing a specification for the function you are trying to write.

Specification: A well-formed binary tree is said to be "height-balanced" if (1) it is empty, or (2) its left and right children are height-balanced and the height of the left tree is within 1 of the height of the right tree.

Now that you have the specification, the code is trivial to write. Just follow the specification:

IsHeightBalanced(tree)
    return (tree is empty) or 
           (IsHeightBalanced(tree.left) and
            IsHeightBalanced(tree.right) and
            abs(Height(tree.left) - Height(tree.right)) <= 1)

Translating that into the programming language of your choice should be trivial.

Bonus exercise: this naive code sketch traverses the tree far too many times when computing the heights. Can you make it more efficient?

Super bonus exercise: suppose the tree is massively unbalanced. Like, a million nodes deep on one side and three deep on the other. Is there a scenario in which this algorithm blows the stack? Can you fix the implementation so that it never blows the stack, even when given a massively unbalanced tree?

UPDATE: Donal Fellows points out in his answer that there are different definitions of 'balanced' that one could choose. For example, one could take a stricter definition of "height balanced", and require that the path length to the nearest empty child is within one of the path to the farthest empty child. My definition is less strict than that, and therefore admits more trees.

One can also be less strict than my definition; one could say that a balanced tree is one in which the maximum path length to an empty tree on each branch differs by no more than two, or three, or some other constant. Or that the maximum path length is some fraction of the minimum path length, like a half or a quarter.

It really doesn't matter usually. The point of any tree-balancing algorithm is to ensure that you do not wind up in the situation where you have a million nodes on one side and three on the other. Donal's definition is fine in theory, but in practice it is a pain coming up with a tree-balancing algorithm that meets that level of strictness. The performance savings usually does not justify the implementation cost. You spend a lot of time doing unnecessary tree rearrangements in order to attain a level of balance that in practice makes little difference. Who cares if sometimes it takes forty branches to get to the farthest leaf in a million-node imperfectly-balanced tree when it could in theory take only twenty in a perfectly balanced tree? The point is that it doesn't ever take a million. Getting from a worst case of a million down to a worst case of forty is usually good enough; you don't have to go all the way to the optimal case.

Best Answer

Related Solutions

C# – Tree data structure in C#

Java – How to determine if binary tree is balanced

Related Topic