Compiler Design – How Are Generics Implemented in a Modern Compiler?

compilerdesign-patternsprogramming-languages

What I mean here is how do we go from some template T add(T a, T b) ... into the generated code? I've thought of a few ways to achieve this, we store the generic function in an AST as Function_Node and then each time we use it we store in the original function node a copy of itself with all of the types T substituted with the types that are being used. For example add<int>(5, 6) will store a copy of the generic function for add and substitute all types T in the copy with int.

So it would look something like:

struct Function_Node {
    std::string name; // etc.
    Type return_type;
    std::vector<std::pair<Type, std::string>> arguments;
    std::vector<Function_Node> copies;
};

Then you could generate code for these and when you visit a Function_Node where the list of copies copies.size() > 0, you invoke visitFunction on all of the copies.

visitFunction(Function_Node& node) {
    if (node.copies.size() > 0) {
        for (auto& node : nodes.copies) {
            visitFunction(node);
        }
        // it's a generic function so we don't want
        // to emit code for this.
        return;
    }
}

Would this work out well? How do modern compilers approach this problem? I think perhaps another way to do this would be you could inject the copies into the AST so that it runs through all of the semantic phases. I also thought perhaps you could generate them in an immediate form like Rust's MIR or Swifts SIL for example.

My code is written in Java, the examples here are C++ because it's a bit less verbose for examples – but the principle is basically the same thing. Though there may be a few errors because it's just written out by hand in the question box.

Note that I mean modern compiler as in what is the best way to approach this problem. And when I say generics I don't mean like Java generics where they use type erasure.

Best Answer

How are generics implemented in a modern compiler?

I invite you to read the source code of a modern compiler if you wish to know how a modern compiler works. I'd start with the Roslyn project, which implements C# and Visual Basic compilers.

In particular I draw your attention to the code in the C# compiler which implements type symbols:

https://github.com/dotnet/roslyn/tree/master/src/Compilers/CSharp/Portable/Symbols

and you might also want to look at the code for convertibility rules. There is much there that pertains to the algebraic manipulation of generic types.

https://github.com/dotnet/roslyn/tree/master/src/Compilers/CSharp/Portable/Binder/Semantics/Conversions

I tried hard to make the latter easy to read.

I've thought of a few ways to achieve this, we store the generic function in an AST as Function_Node and then each time we use it we store in the original function node a copy of itself with all of the types T substituted with the types that are being used.

You are describing templates, not generics. C# and Visual Basic have actual generics in their type systems.

Briefly, they work like this.

We begin by establishing rules for what formally constitutes a type at compile time. For example: int is a type, a type parameter T is a type, for any type X, the array type X[] is also a type, and so on.
The rules for generics involve substitution. For example, class C with one type parameter is not a type. It's a pattern for making types. class C with one type parameter called T, under substitution with int for T is a type.
Rules describing the relationships between types -- compatibility upon assignment, how to determine the type of an expression, and so on -- are designed and implemented in the compiler.
A bytecode language that supports generic types in its metadata system is designed and implemented.
At runtime the JIT compiler turns the bytecode into machine code; it is responsible for constructing the appropriate machine code given a generic specialization.

So for example, in C# when you say

class C<T> { public void X(T t) { Console.WriteLine(t); } }
...
var c = new C<int>(); 
c.X(123);

then the compiler verifies that in C<int>, the argument int is a valid substitution for T, and generates metadata and bytecode accordingly. At runtime, the jitter detects that a C<int> is being created for the first time and generates the appropriate machine code dynamically.

Related Solutions

AST Representation – Homogeneous vs. Heterogeneous AST Representation

For me, the big advantage of the heterogeneous AST is that it forms a kind of forced, annotated switch statement (assuming a C-like language).

For the homogeneous AST you usually end up with some kind of routine or class with a big switch statement. You need to keep track of which child node is what yourself. "First child is the conditional, second the true-block, third the false-block." Whenever you change the code, you easily find yourself making a mental picture of your DSL syntax over and over again.

Of course you can document heavily, but a good program ought to be self-documenting as much as possible. The heterogeneous AST does just that.

Furthermore, you can easily turn a heterogeneous AST into a homogeneous one, but not the other way around. Add the tag info (which is a good idea, unless your language supports a cheap is-a query). You can add Node(int index) methods to return the named fields. So you lose nothing in generality by using the heterogeneous AST.

I won't mention that the heterogeneous AST is ideal for the Visitor pattern, as it is just as easy to use the Strategy pattern with the homogeneous switch routine. It is easier to add specific functionality to the heterogeneous AST itself, though. If you want to turn it into an interpreter, all you need to do is add some kind of "eval" methods.

I would consider a homogeneous AST if there are limiting circumstances. If you need to port the compiler to a system with no OOP language available, or if you need to optimize for speed. The homogeneous AST is easier to combine with an FSM. The latter can also be an advantage if you want to have a general multi-purpose compiler that loads syntax rules on the fly. But it is easier to start out with a heterogeneous AST that will generate those tables, after the compiler has been thoroughly tested.

So, all in all, I would say that neither tree offers specific advantages in terms of "does this tree help or hinder in, say, 'semantic passes'?" The advantage of the heterogeneous AST is, in my experience, to reduce the amount of thought and concentration you have to put into coding the tedious stuff of the compiler. There is a lot of repetitiveness and bookkeeping going on, so let the computer do the work for you as much as possible, is my motto.

Compiler – Procedure for Statically Type Checking Complex Expressions

Recursion is the answer, but you descend into each subtree before handling the operation:

int a = 1 + 2 - 3 * 4 - 5

to tree form:

(assign (a) (sub (sub (add (1) (2)) (mul (3) (4))) (5))

Inferring the type happens by first walking the left hand side, then the right hand side, and then handling the operator as soon as the operands' types are inferred:

(assign*(a) (sub (sub (add (1) (2)) (mul (3) (4))) (5))

-> descend into lhs

(assign (a*) (sub (sub (add (1) (2)) (mul (3) (4))) (5))

-> infer a. a is known to be int. We're back in the assign node now:

(assign (int:a)*(sub (sub (add (1) (2)) (mul (3) (4))) (5))

-> descend into rhs, then into the lhs of the inner operators until we hit something interesting

(assign (int:a) (sub*(sub (add (1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub*(add (1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub (add*(1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub (add (1*) (2)) (mul (3) (4))) (5))

-> infer the type of 1, which is int, and return to the parent

(assign (int:a) (sub (sub (add (int:1)*(2)) (mul (3) (4))) (5))

-> go into the rhs

(assign (int:a) (sub (sub (add (int:1) (2*)) (mul (3) (4))) (5))

-> infer the type of 2, which is int, and return to the parent

(assign (int:a) (sub (sub (add (int:1) (int:2)*) (mul (3) (4))) (5))

-> infer the type of add(int, int), which is int, and return to the parent

(assign (int:a) (sub (sub (int:add (int:1) (int:2))*(mul (3) (4))) (5))

-> descend into the rhs

(assign (int:a) (sub (sub (int:add (int:1) (int:2)) (mul*(3) (4))) (5))

etc., until you end up with

(assign (int:a) (int:sub (int:sub (int:add (int:1) (int:2)) (int:mul (int:3) (int:4))) (int:5))*

Whether the assignment itself is also an expression with a type depends on your language.

The important takeaway: to determine the type of any operator node in the tree, you only have to look at its immediate children, which need to have a type assigned to them already.

Best Answer

Related Solutions

AST Representation – Homogeneous vs. Heterogeneous AST Representation

Compiler – Procedure for Statically Type Checking Complex Expressions

Related Topic