AST Representation – Homogeneous vs. Heterogeneous AST Representation

compilerdata structurestrees

What are the reasons to choose a homogeneous vs. a heterogeneous AST representation for implementing a complex domain-specific programming language?

Just to be very clear about what I'm asking, here is some extra background:

By homogeneous, I mean a tree constructed of nodes which are a single generic type. For example, I think this question is really language independent, but using a C++-like struct for illustration, I'd consider this a minimal homogeneous abstract syntax tree node:

struct Node {
  int tag;
  void *data;

  Node *first_child;
  Node *next_sibling;
};

By heterogeneous, I mean a tree constructed of nodes which are multiple individual types (e.g. one for each grammar production). For example, I don't want to assume a particular language, but again using C++-like structs for illustration, I'd consider these types part of a hierarchy used to build a heterogenous abstract syntax tree tree:

struct Node {};

struct Integer_Node : Node {
  int value;
};

struct Plus_Node : Node {
  Node *right;
  Node *left;
};

struct If_Statement : Node {
  Node *Condition;
  Node *Then_Expression;
  Node *Else_Expression;  
};

// ... more types, depending on the language ...

Over the years, I've implemented several small, special-purpose compilers, usually in a very ad-hoc way. I've never used much of a real "AST" because usually syntax-direct translation has been good enough.

Now I'm in the process of designing and implemented a new, much more complex language, where I will be building an AST and then walking over it with multiple passes for verification, semantic analysis, and so forth.

For example, it seems that using a homogeneous scheme reduces the amount of code up front, but I wonder if a heterogeneous scheme will pay off better in the long run for reasons I'm not considering. On the other hand, the heterogeneous scheme seems like it allows benefiting from the compiler's static type checking, virtual method dispatch, etc, but I wonder if any of that is really very useful when developing semantic passes and so forth.

Basically, I'm hoping to gain some insight from those who may have some real experience here. I've read many compiler books and have a moderate amount of basic compiler-writing experience, but I haven't seen this particular dichotomy addressed in any literature I can get my hands on.

Best Answer

For me, the big advantage of the heterogeneous AST is that it forms a kind of forced, annotated switch statement (assuming a C-like language).

For the homogeneous AST you usually end up with some kind of routine or class with a big switch statement. You need to keep track of which child node is what yourself. "First child is the conditional, second the true-block, third the false-block." Whenever you change the code, you easily find yourself making a mental picture of your DSL syntax over and over again.

Of course you can document heavily, but a good program ought to be self-documenting as much as possible. The heterogeneous AST does just that.

Furthermore, you can easily turn a heterogeneous AST into a homogeneous one, but not the other way around. Add the tag info (which is a good idea, unless your language supports a cheap is-a query). You can add Node(int index) methods to return the named fields. So you lose nothing in generality by using the heterogeneous AST.

I won't mention that the heterogeneous AST is ideal for the Visitor pattern, as it is just as easy to use the Strategy pattern with the homogeneous switch routine. It is easier to add specific functionality to the heterogeneous AST itself, though. If you want to turn it into an interpreter, all you need to do is add some kind of "eval" methods.

I would consider a homogeneous AST if there are limiting circumstances. If you need to port the compiler to a system with no OOP language available, or if you need to optimize for speed. The homogeneous AST is easier to combine with an FSM. The latter can also be an advantage if you want to have a general multi-purpose compiler that loads syntax rules on the fly. But it is easier to start out with a heterogeneous AST that will generate those tables, after the compiler has been thoroughly tested.

So, all in all, I would say that neither tree offers specific advantages in terms of "does this tree help or hinder in, say, 'semantic passes'?" The advantage of the heterogeneous AST is, in my experience, to reduce the amount of thought and concentration you have to put into coding the tedious stuff of the compiler. There is a lot of repetitiveness and bookkeeping going on, so let the computer do the work for you as much as possible, is my motto.