Compiler – Procedure for Statically Type Checking Complex Expressions

compilerlanguage-agnostictypetype-systems

Note: When I used "complex" in the title, I mean that the expression has many operators and operands. Not that the expression itself is complex.


I've recently been working on a simple compiler to x86-64 assembly. I've finished the compiler's main front end – the lexer and parser – and am now able to generate an Abstract Syntax Tree representation of my program. And since my language will be statically typed, I am now doing the next phase: type checking the source code. However, I've come to a problem and have not been able to reasonably solve it myself.

Consider the following example:

My compiler's parser has read this line of code:

int a = 1 + 2 - 3 * 4 - 5

And converted it to the following AST:

       =
     /   \
  a(int)  \
           -
         /   \
        -     5
      /   \
     +     *
    / \   / \
   1   2 3   4

Now it must type check the AST. it starts by first type checking the = operator. It first checks the left hand side of the operator. It sees that the variable a is declared as an integer. So it must now verify that the right hand side expression evaluates to an integer.

I understand how this could be done if the expression was just a single value, such as 1 or 'a'. But how would this be done for expressions with multiple values and operands – a complex expression – such as the one above? To correctly determine the value of the expression, it seems like the type checker would actual have to execute the expression itself and record the result. But this obviously seems to defeat the purpose of separating the compilation and execution phases.

The only other way I imagine this could be done is to recursively check the leaf of each subexpression in the AST and verify all of the leaf's types match the expected operator type. So starting with the = operator, the type checker would then scan all of the left hand side's AST and verify that the leafs are all integers. It would then repeat this for each operator in the subexpression.

I've tried researching the topic in my copy of "The Dragon Book", but it doesn't seem go in to much detail, and simply reiterates what I already know.

What is the usual method used when a compiler is type checking expressions with many operators and operands? Are any of the methods I mentioned above used? If not, what are the methods and how exactly would they work?

Best Answer

Recursion is the answer, but you descend into each subtree before handling the operation:

int a = 1 + 2 - 3 * 4 - 5

to tree form:

(assign (a) (sub (sub (add (1) (2)) (mul (3) (4))) (5))

Inferring the type happens by first walking the left hand side, then the right hand side, and then handling the operator as soon as the operands' types are inferred:

(assign*(a) (sub (sub (add (1) (2)) (mul (3) (4))) (5))

-> descend into lhs

(assign (a*) (sub (sub (add (1) (2)) (mul (3) (4))) (5))

-> infer a. a is known to be int. We're back in the assign node now:

(assign (int:a)*(sub (sub (add (1) (2)) (mul (3) (4))) (5))

-> descend into rhs, then into the lhs of the inner operators until we hit something interesting

(assign (int:a) (sub*(sub (add (1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub*(add (1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub (add*(1) (2)) (mul (3) (4))) (5))
(assign (int:a) (sub (sub (add (1*) (2)) (mul (3) (4))) (5))

-> infer the type of 1, which is int, and return to the parent

(assign (int:a) (sub (sub (add (int:1)*(2)) (mul (3) (4))) (5))

-> go into the rhs

(assign (int:a) (sub (sub (add (int:1) (2*)) (mul (3) (4))) (5))

-> infer the type of 2, which is int, and return to the parent

(assign (int:a) (sub (sub (add (int:1) (int:2)*) (mul (3) (4))) (5))

-> infer the type of add(int, int), which is int, and return to the parent

(assign (int:a) (sub (sub (int:add (int:1) (int:2))*(mul (3) (4))) (5))

-> descend into the rhs

(assign (int:a) (sub (sub (int:add (int:1) (int:2)) (mul*(3) (4))) (5))

etc., until you end up with

(assign (int:a) (int:sub (int:sub (int:add (int:1) (int:2)) (int:mul (int:3) (int:4))) (int:5))*

Whether the assignment itself is also an expression with a type depends on your language.

The important takeaway: to determine the type of any operator node in the tree, you only have to look at its immediate children, which need to have a type assigned to them already.