Compiler – Storing Individual Lines of Code and Functions in a Concrete Syntax Tree

compiler

I'm trying to write a simple compiler for learning purposes. I've been reading the Dragon Book and Modern Compiler Design and one part I don't understand is how the Concrete Syntax Tree is actually created and stored.

I understand that by looping through the Tokens produced by the Lexer it's a simple matter to collect all the pieces of an assignment operator; for example:

int i = 0;

is pretty straight-forward to collect the type, identifier and that we're assigning a value of const_number zero. And I understand how this concrete syntax tree looks like.

And if it's assigned like an expressions like:

int i = a * b;

I also understand what this concrete syntax tree would look like.

But then let's say I have:

int i = functionCall();

What does this look like in a concrete syntax tree?

And further, considering a language like C that's a bunch of functions, with one of them, the main function being denoted as the entry point; how does this all fit into a concrete syntax tree?

Does each one have its own tree?

The creation of a heirarchy of Node types for my tree, each with the specific components it needs makes sense to me; but not how this factors in function calls; unless every single function was inlined.

Additional Info from Comments

So, say I have some code that looks like:

int AddProc(int i, int j)
{
    return i + j;
}

void main()
{
    int x = 8;
    int y = 0;
    int z = x + y;
    x = AddProc(y,z);
}

The Token stream starts from the top to the bottom; simple; each token tells the Parser if it's a TYPE or ID or CONST or ADD_OP whatever. The first stage of the parser is to produce a Concrete Syntax Tree, that's then turned into an Abstract Syntax Tree.

My question is what does the Concrete Syntax Tree look like for the above; and further, the AST as well?

Best Answer

Concrete syntax trees follow directly from the grammatical production rules of the language (i.e. its grammar). They are complex, wordy and offer little benefit for the next phases of the compiler (analysis and code generation).

I don't think that most compilers represent or store concrete syntax (let alone concrete syntax trees), concrete syntax is at best manifest within the parsing algorithm itself (for example, sometimes using recursion); generally on successfully parsing something the parser generates some intermediate data structure, and if that is a tree, it is likely more reflective of an abstract syntax tree.

Look at the diagram for the "Parse Tree" in this answer and https://stackoverflow.com/a/10176731/471129 , and compare with the AST further on in @Guy's answer.

http://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees/

Related Solutions

Data Structures – How Exactly Is an Abstract Syntax Tree Created?

The short answer is that you use stacks. This is a good example, but I'll apply it to an AST.

FYI, this is Edsger Dijkstra's Shunting-Yard Algorithm.

In this case, I will use an operator stack and an expression stack. Since numbers are considered expressions in most languages, I'll use the expression stack to store them.

class ExprNode:
    char c
    ExprNode operand1
    ExprNode operand2

    ExprNode(char num):
        c = num
        operand1 = operand2 = nil

    Expr(char op, ExprNode e1, ExprNode e2):
        c = op
        operand1 = e1
        operand2 = e2

# Parser
ExprNode parse(string input):
    char c
    while (c = input.getNextChar()):
        if (c == '('):
            operatorStack.push(c)

        else if (c.isDigit()):
            exprStack.push(ExprNode(c))

        else if (c.isOperator()):
            while(operatorStack.top().precedence >= c.precedence):
                operator = operatorStack.pop()
                # Careful! The second operand was pushed last.
                e2 = exprStack.pop()
                e1 = exprStack.pop()
                exprStack.push(ExprNode(operator, e1, e2))

            operatorStack.push(c)

        else if (c == ')'):
            while (operatorStack.top() != '('):
                operator = operatorStack.pop()
                # Careful! The second operand was pushed last.
                e2 = exprStack.pop()
                e1 = exprStack.pop()
                exprStack.push(ExprNode(operator, e1, e2))

            # Pop the '(' off the operator stack.
            operatorStack.pop()

        else:
            error()
            return nil

    # There should only be one item on exprStack.
    # It's the root node, so we return it.
    return exprStack.pop()

(Please be nice about my code. I know it's not robust; it's just supposed to be pseudocode.)

Anyway, as you can see from the code, arbitrary expressions can be operands to other expressions. If you have the following input:

5 * 3 + (4 + 2 % 2 * 8)

the code I wrote would produce this AST:

     +
    / \
   /   \
  *     +
 / \   / \
5   3 4   *
         / \
        %   8
       / \
      2   2

And then when you want to produce the code for that AST, you do a Post Order Tree Traversal. When you visit a leaf node (with a number), you generate a constant because the compiler needs to know the operand values. When you visit a node with an operator, you generate the appropriate instruction from the operator. For example, the '+' operator gives you an "add" instruction.

Compiler – In Which Process Does Syntax Error Occur? (Tokenizing or Parsing)

A tokenizer is just a parser optimization. It's perfectly possible to implement a parser without a tokenizer.

A tokenizer (or lexer, or scanner) chops the input into a list of tokens. Some parts of the string (comments, whitespace) are usually ignored. Each token has a type (the meaning of this string in the language) and a value (the string that makes up the token). For example, the PHP source snippet

$a + $b

could be represented by the tokens

Variable('$a'),
Plus('+'),
Variable('$b')

The tokenizer does not consider whether a token is possible in this context. For example, the input

$a $b + +

would happily produce the token stream

Variable('$a'),
Variable('$b'),
Plus('+'),
Plus('+')

When the parser then consumes these tokens, it will notice that two variables cannot follow each other, and neither can two infix operators. (Note that other languages have different syntaxes where such a token stream may be legal, but not in PHP).

A parser may still fail at the tokenizer stage. For example, there might be an illegal character:

$a × ½ — 3

A PHP tokenizer would be unable to match this input to its rules, and would produce an error before the main parsing starts.

More formally, tokenizers are used when each token can be described as a regular language. The tokens can then be matched extremely efficiently, possibly implemented as a DFA. In contrast, the main grammar is usually context-free and requires more complicated, less performant parsing algorithm such as LALR.

Best Answer

Related Solutions

Data Structures – How Exactly Is an Abstract Syntax Tree Created?

Compiler – In Which Process Does Syntax Error Occur? (Tokenizing or Parsing)

Related Topic