Binary Tree – How to Represent a Directory Structure

binary treecdata structures

I was just thinking about implementing a binary tree with its root pointing to the parent folder and then the children pointing to subdirs nested within to store a directory structure in a binary tree.

I am not sure how to cater this problem with binary trees or if its possible?
M-way trees seems like the correct way of approaching this problem.

However, using linked list instead of a single node for binary trees and then dividing subdirs/2 at every level and storing them respectively in the left and right nodes(linked list), ensuring every node points to exactly two linked lists (left & right) which further store the subdirs.

Still confused on how to approach this

any suggestions?

Best Answer

Sufficient effort can make most things work, but I personally probably wouldn't use a plain binary tree for this in the future, and never have in the past. Here's an outline of the options as I see them:

  1. Simplest binary tree approach: each directory has no more than one direct parent directory, and no more than two child directories. I don't think this is what you want, but maybe?

    It's worth noting that e.g. Unix-style directories can have multiple parents, as both symbolic links (text files, as I recall), and hard links (which are implemented by the actual filesystem itself) allow multiple directories to all include the same directory as one of their children (even if that child is one of those parent directories, as I recall); thus the simplest binary tree system couldn't properly represent Unix directories.

    Similarly, you probably want to support more than two child directories.

    I recommend against the simplest binary implementation.
  2. "Decorated" binary tree: much as a red-black tree carries a little extra data to differentiate different types of nodes (red nodes vs black nodes), so too could your binary directory tree. For example, you could have a node type that uses the root pointer for it's child, while it's child pointers are it's parents. If you found two children or two parents too restrictive, then you could have a flag that indicated a particular node was internal, and that getting to the "actual" pointers required more traversal.

    I don't really suggest this because I consider it to mix different aspects of the subject too much, but it's worth considering.
  3. Nodes with more than 2 children / 1 parent: this is effectively just an optimization of option (2). I do feel that it's worth pointing out that filesystems and other types of databases do traditionally use M-way trees, particularly B-trees, T-trees, etc., because both latency and generally slower speeds of bulk and persistent data stores usually make them more efficient than simple binary trees, even if algorithms are much more complicated.

    Regardless, unless you're going for option (1), M-way trees are an optimization instead of a completely distinct method. I suggest them for later optimizations.
  4. Simple tries: tries, like M-way trees, are mostly an optimization of option (2). Tries are used for keys that are either arbitrarily long, or just longer than the size you've chosen for your tree nodes. They cut an individual key into parts, such that an individual node's actual key consists of the concatenation of all of the keys from the root to itself. When handling strings, tries allow much more memory-efficient trees, since partially identical strings can share some or all of their matching portions. "Tree" and "Trie" would both share their first two letters, while "there" and "their" would share the first three letters, for example. Words such as "apple" and "banana" point to a complication of the system: if no areas of the keys can be shared, or the number of children splitting off of a node grows to be greater than the number of child pointers ("tree", "take", and "there" would encounter this issue with binary tree styled nodes), then the trie variant needs a way to support this.

    I recommend against simple implementations of tries.
  5. "Decorated" tries: essentially (2) & (4) combined, possibly with (3) mixed in for efficiency.

    If you want to use only a single format of node, then this is probably your best bet. I conditionally (you want a single node format) recommend in favor of "decorated" tries.
  6. Internally complex tries: essentially use (4) to define the interface, and then use multiple node types internally for your implementation. If using just binary trees, then the implementation would consist of an arbitrarily deep nesting of trees, with each individual tree used to differentiate between the first character of a sub-key, and both the rest of the sub-key, as well as any value/pointer that corresponded to that position in the trie, being associated with the root pointer to the child tree.

    I strongly recommend internally complex tries. (2), (3), and hash-tables can be mixed in if optimization dictates, and in the meantime a fully-capable system can be created with a relatively minor amount of complication in comparison to a simple binary tree. If multiply-parented directories are desired, then the directories can each be separated into a distinct trie, with an additional system to indicate parent directories, whether a "reversal" flag as in (2), a "search stack" system that completely removes parentage tracking from the system, and instead tracks how a particular search got to the current directories, or even some mixture of techniques.

Personally, I've always gone with (6). I consider it's separation-of-concerns to be particularly desirable. If you have a particular requirement that forces you into something else, I would advise running option (6) by the specifier, just to see if they consider it compliant with the part of the requirement that you're concerned about.