Data Structures – How Exactly Is an Abstract Syntax Tree Created?

compilerdata structuresparsingtrees

I think I understand the goal of an AST, and I've built a couple of tree structures before, but never an AST. I'm mostly confused because the nodes are text and not number, so I can't think of a nice way to input a token/string as I'm parsing some code.

For example, when I looked at diagrams of AST's, the variable and its value were leaf nodes to an equal sign. This makes perfect sense to me, but how would I go about implementing this? I guess I can do it case by case, so that when I stumble upon an "=" I use that as a node, and add the value parsed before the "=" as the leaf. It just seems wrong, because I'd probably have to make cases for tons and tons of things, depending on the syntax.

And then I came upon another problem, how is the tree traversed? Do I go all the way down the height, and go back up a node when I hit the bottom, and do the same for it's neighbor?

I've seen tons of diagrams on ASTs, but I couldn't find a fairly simple example of one in code, which would probably help.

Best Answer

The short answer is that you use stacks. This is a good example, but I'll apply it to an AST.

FYI, this is Edsger Dijkstra's Shunting-Yard Algorithm.

In this case, I will use an operator stack and an expression stack. Since numbers are considered expressions in most languages, I'll use the expression stack to store them.

class ExprNode:
    char c
    ExprNode operand1
    ExprNode operand2

    ExprNode(char num):
        c = num
        operand1 = operand2 = nil

    Expr(char op, ExprNode e1, ExprNode e2):
        c = op
        operand1 = e1
        operand2 = e2

# Parser
ExprNode parse(string input):
    char c
    while (c = input.getNextChar()):
        if (c == '('):
            operatorStack.push(c)

        else if (c.isDigit()):
            exprStack.push(ExprNode(c))

        else if (c.isOperator()):
            while(operatorStack.top().precedence >= c.precedence):
                operator = operatorStack.pop()
                # Careful! The second operand was pushed last.
                e2 = exprStack.pop()
                e1 = exprStack.pop()
                exprStack.push(ExprNode(operator, e1, e2))

            operatorStack.push(c)

        else if (c == ')'):
            while (operatorStack.top() != '('):
                operator = operatorStack.pop()
                # Careful! The second operand was pushed last.
                e2 = exprStack.pop()
                e1 = exprStack.pop()
                exprStack.push(ExprNode(operator, e1, e2))

            # Pop the '(' off the operator stack.
            operatorStack.pop()

        else:
            error()
            return nil

    # There should only be one item on exprStack.
    # It's the root node, so we return it.
    return exprStack.pop()

(Please be nice about my code. I know it's not robust; it's just supposed to be pseudocode.)

Anyway, as you can see from the code, arbitrary expressions can be operands to other expressions. If you have the following input:

5 * 3 + (4 + 2 % 2 * 8)

the code I wrote would produce this AST:

     +
    / \
   /   \
  *     +
 / \   / \
5   3 4   *
         / \
        %   8
       / \
      2   2

And then when you want to produce the code for that AST, you do a Post Order Tree Traversal. When you visit a leaf node (with a number), you generate a constant because the compiler needs to know the operand values. When you visit a node with an operator, you generate the appropriate instruction from the operator. For example, the '+' operator gives you an "add" instruction.