AST Representation – Homogeneous vs. Heterogeneous AST Representation

compilerdata structurestrees

What are the reasons to choose a homogeneous vs. a heterogeneous AST representation for implementing a complex domain-specific programming language?

Just to be very clear about what I'm asking, here is some extra background:

By homogeneous, I mean a tree constructed of nodes which are a single generic type. For example, I think this question is really language independent, but using a C++-like struct for illustration, I'd consider this a minimal homogeneous abstract syntax tree node:

struct Node {
  int tag;
  void *data;

  Node *first_child;
  Node *next_sibling;
};

By heterogeneous, I mean a tree constructed of nodes which are multiple individual types (e.g. one for each grammar production). For example, I don't want to assume a particular language, but again using C++-like structs for illustration, I'd consider these types part of a hierarchy used to build a heterogenous abstract syntax tree tree:

struct Node {};

struct Integer_Node : Node {
  int value;
};

struct Plus_Node : Node {
  Node *right;
  Node *left;
};

struct If_Statement : Node {
  Node *Condition;
  Node *Then_Expression;
  Node *Else_Expression;  
};

// ... more types, depending on the language ...

Over the years, I've implemented several small, special-purpose compilers, usually in a very ad-hoc way. I've never used much of a real "AST" because usually syntax-direct translation has been good enough.

Now I'm in the process of designing and implemented a new, much more complex language, where I will be building an AST and then walking over it with multiple passes for verification, semantic analysis, and so forth.

For example, it seems that using a homogeneous scheme reduces the amount of code up front, but I wonder if a heterogeneous scheme will pay off better in the long run for reasons I'm not considering. On the other hand, the heterogeneous scheme seems like it allows benefiting from the compiler's static type checking, virtual method dispatch, etc, but I wonder if any of that is really very useful when developing semantic passes and so forth.

Basically, I'm hoping to gain some insight from those who may have some real experience here. I've read many compiler books and have a moderate amount of basic compiler-writing experience, but I haven't seen this particular dichotomy addressed in any literature I can get my hands on.

Best Answer

For me, the big advantage of the heterogeneous AST is that it forms a kind of forced, annotated switch statement (assuming a C-like language).

For the homogeneous AST you usually end up with some kind of routine or class with a big switch statement. You need to keep track of which child node is what yourself. "First child is the conditional, second the true-block, third the false-block." Whenever you change the code, you easily find yourself making a mental picture of your DSL syntax over and over again.

Of course you can document heavily, but a good program ought to be self-documenting as much as possible. The heterogeneous AST does just that.

Furthermore, you can easily turn a heterogeneous AST into a homogeneous one, but not the other way around. Add the tag info (which is a good idea, unless your language supports a cheap is-a query). You can add Node(int index) methods to return the named fields. So you lose nothing in generality by using the heterogeneous AST.

I won't mention that the heterogeneous AST is ideal for the Visitor pattern, as it is just as easy to use the Strategy pattern with the homogeneous switch routine. It is easier to add specific functionality to the heterogeneous AST itself, though. If you want to turn it into an interpreter, all you need to do is add some kind of "eval" methods.

I would consider a homogeneous AST if there are limiting circumstances. If you need to port the compiler to a system with no OOP language available, or if you need to optimize for speed. The homogeneous AST is easier to combine with an FSM. The latter can also be an advantage if you want to have a general multi-purpose compiler that loads syntax rules on the fly. But it is easier to start out with a heterogeneous AST that will generate those tables, after the compiler has been thoroughly tested.

So, all in all, I would say that neither tree offers specific advantages in terms of "does this tree help or hinder in, say, 'semantic passes'?" The advantage of the heterogeneous AST is, in my experience, to reduce the amount of thought and concentration you have to put into coding the tedious stuff of the compiler. There is a lot of repetitiveness and bookkeeping going on, so let the computer do the work for you as much as possible, is my motto.

Related Solutions

Visitor Pattern – Implementing for an Abstract Syntax Tree

It is up to the visitor implementation to decide whether to visit child nodes and in which order. That's the whole point of the visitor pattern.

In order to adapt the visitor for more situations it is helpful (and quite common) to use generics like this (it's Java):

public interface ExpressionNodeVisitor<R, P> {
    R visitNumber(NumberNode number, P p);
    R visitBinary(BinaryNode expression, P p);
    // ...
}

And an accept method would look like this:

public interface ExpressionNode extends Node {
    <R, P> R accept(ExpressionNodeVisitor<R, P> visitor, P p);
    // ...
}

This allows to pass additional parameters to visitor and retrieve a result from it. So, the expression evaluation can be implemented like this:

public class EvaluatingVisitor
    implements ExpressionNodeVisitor<Double, Void> {
    public Double visitNumber(NumberNode number, Void p) {
        // Parse the number and return it.
        return Double.valueOf(number.getText());
    }
    public Double visitBinary(BinaryNode binary, Void p) {
        switch (binary.getOperator()) {
        case '+':
            return binary.getLeftOperand().accept(this, p)
                + binary.getRightOperand().accept(this, p);
        // More cases for other operators here.
        }
    }
}

The accept method parameter isn't used in the above example, but just believe me: it is quite useful to have one. For example, it can be a Logger instance to report errors to.

Design – How do we keep dependent data structures up to date

I think your scenarios are discussing variations on the Observer Pattern. Each original node (“subject”) has (at least) the following two methods:

registerObserver(observer) – adds a dependent node to the list of observers.
notifyObservers() – calls x.notify(this) on each observer

And each dependent node (“observer”) has a notify(original) method. Comparing your scenarios:

The notify method immediately rebuilds a dependent subtree.
The notify method sets a flag, the recomputation happens after each set of updates.
The notifyObservers method is smart and only notifies those observers whose constraints are invalidated. This would probably use the Visitor Pattern, so that the dependent nodes can offer a method that decides this.
(this pattern has no relation to brute-force rebuilding)

As the first three ideas are just variations on the observer pattern, their design will have similar complexity (as it happens, they are actually ordered in increasing complexity – I'd think №1 is the most simple to implement).

I can think of one enhancement: building the dependent trees lazily. Each dependent node would then have a boolean flag that is either set to valid or invalid. Each accessor method would check this flag and, if necessary, recalculate the subtree. The difference to №2 is that recalculation happens on access, not upon change. This would probably result in fewest computations, but can lead to significant difficulties if the type of a node would have to change upon access.

I would also like to challenge the need for multiple dependent trees. For example, I always structure my parsers in a way that they immediately emit an AST. Information that is only relevant during construction of this tree doesn't have to be stored in any permanent data structure. Likewise, you can also choose your objects in such a way that the AST has an interpretation as a control flow graph.

For a real-life example, the compiler part inside the perl interpreter does this: The AST is built bottom-up, during which some nodes are constant-folded away. In a second run, the nodes are connected in execution order, during which some nodes are skipped by optimizations. The result is very fast parsing (and few allocations), but very limited optimizations. It should be noted that while such a design is possible, it is probably not something you should strive for: It is a ~~calculated trade-off~~ complete violation of the Single-Responsibility Principle.

If you do actually need multiple trees, then you should also consider whether they really have to be built simultaneously. In the majority of cases, a parse tree is constant after the parse. Likewise, an AST will probably stay constant after macros are resolved and AST-level optimizations have been executed.

Best Answer

Related Solutions

Visitor Pattern – Implementing for an Abstract Syntax Tree

Design – How do we keep dependent data structures up to date

Related Topic