For me, the big advantage of the heterogeneous AST is that it forms a kind of forced, annotated switch
statement (assuming a C-like language).
For the homogeneous AST you usually end up with some kind of routine or class with a big switch
statement. You need to keep track of which child node is what yourself. "First child is the conditional, second the true-block, third the false-block." Whenever you change the code, you easily find yourself making a mental picture of your DSL syntax over and over again.
Of course you can document heavily, but a good program ought to be self-documenting as much as possible. The heterogeneous AST does just that.
Furthermore, you can easily turn a heterogeneous AST into a homogeneous one, but not the other way around. Add the tag info (which is a good idea, unless your language supports a cheap is-a
query). You can add Node(int index)
methods to return the named fields. So you lose nothing in generality by using the heterogeneous AST.
I won't mention that the heterogeneous AST is ideal for the Visitor pattern, as it is just as easy to use the Strategy pattern with the homogeneous switch
routine. It is easier to add specific functionality to the heterogeneous AST itself, though. If you want to turn it into an interpreter, all you need to do is add some kind of "eval" methods.
I would consider a homogeneous AST if there are limiting circumstances. If you need to port the compiler to a system with no OOP language available, or if you need to optimize for speed. The homogeneous AST is easier to combine with an FSM. The latter can also be an advantage if you want to have a general multi-purpose compiler that loads syntax rules on the fly. But it is easier to start out with a heterogeneous AST that will generate those tables, after the compiler has been thoroughly tested.
So, all in all, I would say that neither tree offers specific advantages in terms of "does this tree help or hinder in, say, 'semantic passes'?" The advantage of the heterogeneous AST is, in my experience, to reduce the amount of thought and concentration you have to put into coding the tedious stuff of the compiler. There is a lot of repetitiveness and bookkeeping going on, so let the computer do the work for you as much as possible, is my motto.
Recursion is the answer, but you descend into each subtree before handling the operation:
int a = 1 + 2 - 3 * 4 - 5
to tree form:
(assign (a) (sub (sub (add (1) (2)) (mul (3) (4))) (5))
Inferring the type happens by first walking the left hand side, then the right hand side, and then handling the operator as soon as the operands' types are inferred:
(assign*(a) (sub (sub (add (1) (2)) (mul (3) (4))) (5))
-> descend into lhs
(assign (a*) (sub (sub (add (1) (2)) (mul (3) (4))) (5))
-> infer a
. a
is known to be int
. We're back in the assign
node now:
(assign (int:a)*(sub (sub (add (1) (2)) (mul (3) (4))) (5))
-> descend into rhs, then into the lhs of the inner operators until we hit something interesting
(assign (int:a) (sub*(sub (add (1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub*(add (1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub (add*(1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub (add (1*) (2)) (mul (3) (4))) (5))
-> infer the type of 1
, which is int
, and return to the parent
(assign (int:a) (sub (sub (add (int:1)*(2)) (mul (3) (4))) (5))
-> go into the rhs
(assign (int:a) (sub (sub (add (int:1) (2*)) (mul (3) (4))) (5))
-> infer the type of 2
, which is int
, and return to the parent
(assign (int:a) (sub (sub (add (int:1) (int:2)*) (mul (3) (4))) (5))
-> infer the type of add(int, int)
, which is int
, and return to the parent
(assign (int:a) (sub (sub (int:add (int:1) (int:2))*(mul (3) (4))) (5))
-> descend into the rhs
(assign (int:a) (sub (sub (int:add (int:1) (int:2)) (mul*(3) (4))) (5))
etc., until you end up with
(assign (int:a) (int:sub (int:sub (int:add (int:1) (int:2)) (int:mul (int:3) (int:4))) (int:5))*
Whether the assignment itself is also an expression with a type depends on your language.
The important takeaway: to determine the type of any operator node in the tree, you only have to look at its immediate children, which need to have a type assigned to them already.
Best Answer
I invite you to read the source code of a modern compiler if you wish to know how a modern compiler works. I'd start with the Roslyn project, which implements C# and Visual Basic compilers.
In particular I draw your attention to the code in the C# compiler which implements type symbols:
https://github.com/dotnet/roslyn/tree/master/src/Compilers/CSharp/Portable/Symbols
and you might also want to look at the code for convertibility rules. There is much there that pertains to the algebraic manipulation of generic types.
https://github.com/dotnet/roslyn/tree/master/src/Compilers/CSharp/Portable/Binder/Semantics/Conversions
I tried hard to make the latter easy to read.
You are describing templates, not generics. C# and Visual Basic have actual generics in their type systems.
Briefly, they work like this.
We begin by establishing rules for what formally constitutes a type at compile time. For example:
int
is a type, a type parameterT
is a type, for any typeX
, the array typeX[]
is also a type, and so on.The rules for generics involve substitution. For example,
class C with one type parameter
is not a type. It's a pattern for making types.class C with one type parameter called T, under substitution with int for T
is a type.Rules describing the relationships between types -- compatibility upon assignment, how to determine the type of an expression, and so on -- are designed and implemented in the compiler.
A bytecode language that supports generic types in its metadata system is designed and implemented.
At runtime the JIT compiler turns the bytecode into machine code; it is responsible for constructing the appropriate machine code given a generic specialization.
So for example, in C# when you say
then the compiler verifies that in
C<int>
, the argumentint
is a valid substitution forT
, and generates metadata and bytecode accordingly. At runtime, the jitter detects that aC<int>
is being created for the first time and generates the appropriate machine code dynamically.