Design – How do we keep dependent data structures up to date

data structuresdesigndesign-patternslanguage-agnosticobject-oriented-design

Suppose you have a parse tree, an abstract syntax tree, and a control flow graph, each one logically derived from the one before. In principle it is easy to construct each graph given the parse tree, but how can we manage the complexity of updating the graphs when the parse tree is modified? We know exactly how the tree has been modified, but how can the change be propagated to the other trees in a way that doesn't become difficult to manage?

Naturally the dependent graph can be updated by simply reconstructing it from scratch every time the first graph changes, but then there would be no way of knowing the details of the changes in the dependent graph.

I currently have four ways to attempt to solve this problem, but each one has difficulties.

  1. Nodes of the dependent tree each observe the relevant nodes of the original tree, updating themselves and the observer lists of original tree nodes as necessary. The conceptual complexity of this can become daunting.
  2. Each node of the original tree has a list of the dependent tree nodes that specifically depend upon it, and when the node changes it sets a flag on the dependent nodes to mark them as dirty, including the parents of the dependent nodes all the way down to the root. After each change we run an algorithm that is much like the algorithm for constructing the dependent graph from scratch, but it skips over any clean node and reconstructs each dirty node, keeping track of whether the reconstructed node is actually different from the dirty node. This can also get tricky.
  3. We can represent the logical connection between the original graph and the dependent graph as a data structure, like a list of constraints, perhaps designed using a declarative language. When the original graph changes we need only scan the list to discover which constraints are violated and how the dependent tree needs to change to correct the violation, all encoded as data.
  4. We can reconstruct the dependent graph from scratch as though there were no existing dependent graph, and then compare the existing graph and the new graph to discover how it has changed. I'm sure this is the easiest way because I know there are algorithms available for detecting differences, but they are all quite computationally expensive and in principle it seems unnecessary so I'm deliberately avoiding this option.

What is the right way to deal with these sorts of problems? Surely there must be a design pattern that makes this whole thing almost easy. It would be nice to have a good solution for every problem of this general description. Does this class of problem have a name?


Let me elaborate on the troubles that this problem causes. This issue pops up in various places, whenever two parts of a project operate on graphs, with each graph being a different representation of the same thing that changes while the software is running. It's like making an adapter for an interface, but instead of wrapping a single object or a fixed number of objects, we need to wrap an entire graph of arbitrary size.

Every time I try this I end up with a confusing unmaintainable mess. The control flow of observers can be difficult to follow when it gets complicated, and the algorithm for converting one graph to another is usually tricky enough to follow when its laid out plainly and not spread across multiple classes. The problem is that there seems to be no way to use just a plain, straight-forward graph conversion algorithm when the original graph is changing.

Naturally we can't just use an ordinary graph conversion algorithm directly because that can't respond to changes in any way other than starting from scratch, so what are the alternatives? Perhaps the algorithm could be written in a continuation-passing style where each step of the algorithm is represented as an object with a method for each type of node in the original graph, like a visitor. Then the algorithm can be assembled by composing various simple visitors together.


Another example: Suppose you have a GUI that is laid out like you might in Java Swing, using JPanels and layout managers. You can simplify that process by using nested JPanels in place of complex layout managers, so you end up with a tree of various containers that includes nodes that exist only for layout purposes and are otherwise meaningless. Now suppose that the same tree that is used to generate your GUI is also used in another part of your application, but instead of laying the tree out graphically it is working with a library that will generate an abstract representation tree as a system of folders. In order to use this library, we need to have a version of the tree that doesn't have the layout nodes; the layout nodes need to be flattened into their parent nodes, but the library still needs to be notified each time the tree changes as though the two versions of the tree were a single data structure.


Another way to look at it: The very concept of working with mutable trees violates the Law of Demeter. It wouldn't really be a violation of the law if the tree were a value as parse trees and syntax trees normally are, but in that case there would be no problem since nothing would need to be kept up-to-date. So then this problem exists as a direct result of violating the Law of Demeter, but how do you avoid that in general when your domain seems to be about manipulating trees or graphs?

The Composite pattern is a wonderful tool for turning a graph into a single object and obeying the Law of Demeter. Is it possible to use the Composite pattern to effectively turn one kind of tree into another? Can you make a Composite parse tree so that it acts like an abstract syntax tree and even a control flow graph? Is there a way to do it without violating the single responsibility principle? The Composite pattern tends to cause classes to absorb every responsibility that they touch, but perhaps it could be combined with the Strategy pattern somehow.

Best Answer

I think your scenarios are discussing variations on the Observer Pattern. Each original node (“subject”) has (at least) the following two methods:

  • registerObserver(observer) – adds a dependent node to the list of observers.
  • notifyObservers() – calls x.notify(this) on each observer

And each dependent node (“observer”) has a notify(original) method. Comparing your scenarios:

  1. The notify method immediately rebuilds a dependent subtree.
  2. The notify method sets a flag, the recomputation happens after each set of updates.
  3. The notifyObservers method is smart and only notifies those observers whose constraints are invalidated. This would probably use the Visitor Pattern, so that the dependent nodes can offer a method that decides this.
  4. (this pattern has no relation to brute-force rebuilding)

As the first three ideas are just variations on the observer pattern, their design will have similar complexity (as it happens, they are actually ordered in increasing complexity – I'd think №1 is the most simple to implement).

I can think of one enhancement: building the dependent trees lazily. Each dependent node would then have a boolean flag that is either set to valid or invalid. Each accessor method would check this flag and, if necessary, recalculate the subtree. The difference to №2 is that recalculation happens on access, not upon change. This would probably result in fewest computations, but can lead to significant difficulties if the type of a node would have to change upon access.


I would also like to challenge the need for multiple dependent trees. For example, I always structure my parsers in a way that they immediately emit an AST. Information that is only relevant during construction of this tree doesn't have to be stored in any permanent data structure. Likewise, you can also choose your objects in such a way that the AST has an interpretation as a control flow graph.

For a real-life example, the compiler part inside the perl interpreter does this: The AST is built bottom-up, during which some nodes are constant-folded away. In a second run, the nodes are connected in execution order, during which some nodes are skipped by optimizations. The result is very fast parsing (and few allocations), but very limited optimizations. It should be noted that while such a design is possible, it is probably not something you should strive for: It is a calculated trade-off complete violation of the Single-Responsibility Principle.

If you do actually need multiple trees, then you should also consider whether they really have to be built simultaneously. In the majority of cases, a parse tree is constant after the parse. Likewise, an AST will probably stay constant after macros are resolved and AST-level optimizations have been executed.