Scala – Right Way to Read from Token List to Implement Parse Tree

parsingscala

I'm trying to tackle the While Language problem on HackerRank using Scala. I've been given a set of grammar rules and have to implement the actual interpreter underlying it.

I've decided to approach this by building a lexer and parser in two separate stages. My lexer works perfectly: it correctly tags tokens and wraps them in a List of Tokens where Tokens is defined as follows:

case class Token(typeof: String, value: Any);

Right now, my problem is purely conceptual: how should I scan through my token list to implement a parse tree?

Note that I don't yet have any implementations for the actual nodes of my parse tree, as I want to solve this problem first (though I'm happy to hear suggestions for the same if it is indispensable to solving this).

Here is what I've done so far:

I've taken a look at the shunting-yard algorithm, but I'm not sure how to generalise it for anything beyond arithmetic operations. I thought of having multiple stacks designed for different node types, but I'm not convinced it's the right way to do it – this would easily become unmaintainable.
I've considered a naive linear traversal where I create an empty list, insert my tokens from my token list one by one, and go through this stack each time trying to match a grammar rule to it (if it does match, I remove the nodes from the list otherwise I leave it be), but I'm hesitant to do this because it seems a) inefficient and b) defeats the point of a separate lexing phase. I'm looking for something cleaner.

Beyond this, I'm stuck. I'm new to the parsing game and would like to understand what the right approach is.

HackerRank only supports Scala 2.11, which officially removed the parser-combinator library from the standard library, so I can't use that. I'm happy to employ any functional features that Scala brings, but I'm not interested in relying on something that only functional languages implement.

For the record, here is my Lexer code:

import scala.collection.mutable.HashMap

class Lexer {
    // defining keywords and tokens
    val parenthesis_mapping = HashMap("""(""" -> """)""","""{""" -> """}""");
    val keywords = List("if", "then", "else", "while", "do", "true", "false");
    val boolean_operators = List("and","or")
    val arithmetic_operators = List("""+""","""-""","""/""","""*""",""">""","""<""")
    val variables = """([a-z]+)""".r
    val numbers = """([0-9]*\.?[0-9]+)""".r

    // converts string token to corresponding Token type
    def convert(input : String): Token = {
        if (keywords.contains(input)) { return Token("KEYWORD",input) }
        else if (boolean_operators.contains(input)) { return Token("BOOLEAN",input) }
        else if (arithmetic_operators.contains(input)) { return Token("ARITHMETIC", input) }
        else if (parenthesis_mapping.keys.exists(_ == input)) { return Token("OPENING", input) }
        else if (parenthesis_mapping.values.exists(_ == input)) { return Token("CLOSING", input) }
        else { return input match {
            case variables(output) => Token("VARIABLE", output)
            case """:=""" => Token("ASSIGNMENT", null)
            case numbers(output) => Token("NUMBER", output.toInt)
            case """;""" => Token("BREAK", null)
            case _ => Token("NULL", input)
        } }
    }

   // splits input string and applies `convert` to each token
   def lex(strings: String): List[Token] = {...} // only providing signature here
  }

and here is the grammar I have to implement

Below is the description of grammar that we will use:

x, y ∈ Var (variables)

n ∈ Num (numerals/integers)

op_{a} ∈ Opa (arithmetic operators)
ob_{a} ::= + | - | * | /

op_{b} ∈ Opb (boolean operators)
op_{b} ::= and | or

op_{r} ∈ Opr (relational operators)
op_{r} ::= > | <

a ∈ AExp (arithmetic expressions)
a ::= x | n | a1 opa a2 | ( a )

b ∈ BExp (boolean expressions)
b ::= true | false | b1 opb b2 | a1 opr a2 | ( b )

S ∈ Stmt (statements)
S ::= x := a | S1 ; S2 | if b then { S1 } else { S2 } | while b do { S }

Here all operators are left associative. Their precedence order is as
follows.

Arithmetic Operators: (*, /) > (+, -) > (>, <)

Boolean Operators: and > or.

You can safely assume that all variables have integer type and are initialized properly. All variables name will consist of only
lowercase letter ('a'-'z') and it's length will not exceed 10.

Note that ";" is more like of a sequencing operator. It is used to
concatenate two statements. That's why there will be no ";" at the end
of block of statements.

An example of the resulting syntax:

fact := 1 ;
val := 10000 ;
cur := val ;
mod := 1000000007 ;

while ( cur > 1 )
  do
   {
      fact := fact * cur ;
      fact := fact - fact / mod * mod ;
      cur := cur - 1
   } ;

cur := 0

Best Answer

I would suggest you use your Lexer construct as more of a generator, rather than identifying all the tokens at once and sending an entire list to the parser. Your lexer could simply declare an interface with a .NextToken() or .Next() (and probably .Backup()) method and anytime the Parser requires a new token, it could simply call one of those methods.

If your language requires lookahead tokens, you could simply keep a list of lookahead tokens on the Parser and always lag N number of tokens behind the Lexer's current token.

As far as actual parsing techniques are concerned, you should probably consider recursive descent parsing. In my opinion, these are typically the easiest types of parsers to code by hand and are conceptually the simplest parsers to understand.

Recursive descent parsing typically works by matching one terminal grammar production to a function or method. So if you have a grammar production for a variable declaration like var x := 1, your method could look something like this:

// Grammar Production:
//   'var' Identifier ':=' Expression
private Declaration ParseDeclaration() {
    Declaration decl = new Declaration();

    if (!Expect(TokenType.Var)) {
        throw new ParseError("Expected 'var' keyword");
    }

    decl.Identifier = ParseIdentifier();

    if (!Expect(TokenType.AssignmentEquals)) {
        throw new ParseError("Expected token ':='");
    }

    decl.Expression = ParseExpression();
    return decl;
}

For each production in your grammar, you could define a method like that one and recursively construct your parse tree. It ends up being a really elegant system to code, rather than trying to match productions using token patterns. You can use the power of your programming language's call stack to control flow and handle repetition and parsing errors. Shunting yard is challenging to generalize to non expression grammar rules too, so I don't recommend you go down that path.

P.S. I'm sorry, I don't know Scala syntax offhand, so I opted to just use a Java/C# type of syntax.

Related Solutions

Python – Creating Tokens for a Lexer

Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.

The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.

This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are

This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.

Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.

To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.

You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.

Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.

As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.

There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.

Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.

Parser Combinator vs Parser Generator – When to Use Each

I have done a lot of research these past few days, to understand better why these separate technologies exist, and what their strengths and weaknesses are.

Some of the already-existing answers hinted at some of their differences, but they did not give the complete picture, and seemed to be somewhat opinionated, which is why this answer was written.

This exposition is long, but important. bear with me (Or if you're impatient, scroll to the end to see a flowchart).

To understand the differences between Parser Combinators and Parser Generators, one first needs to understand the difference between the various kinds of parsing that exist.

Parsing

Parsing is the process of analyzing of a string of symbols according to a formal grammar. (In Computing Science,) parsing is used to be able to let a computer understand text written in a language, usually creating a parse tree that represents the written text, storing the meaning of the different written parts in each node of the tree. This parse tree can then be used for a variety of different purposes, such as translating it to another language (used in many compilers), interpreting the written instructions directly in some way (SQL, HTML), allowing tools like Linters to do their work, etc. Sometimes, a parse tree is not explicitly generated, but rather the action that should be performed at each type of node in the tree is executed directly. This increases efficiency, but underwater still an implicit parse tree exists.

Parsing is a problem that is computationally difficult. There has been over fifty years of research on this subject, but there is still much to learn.

Roughly speaking, there are four general algorithms to let a computer parse input:

LL parsing. (Context-free, top-down parsing.)
LR parsing. (Context-free, bottom-up parsing.)
PEG + Packrat parsing.
Earley Parsing.

Note that these types of parsing are very general, theoretical descriptions. There are multiple ways to implement each of these algorithms on physical machines, with different tradeoffs.

LL and LR can only look at Context-Free grammars (that is; the context around the tokens that are written is not important to understand how they are used).

PEG/Packrat parsing and Earley parsing are used a lot less: Earley-parsing is nice in that it can handle a whole lot more grammars (including those that are not necessarily Context-Free) but it is less efficient (as claimed by the dragon book (section 4.1.1); I am not sure if these claims are still accurate). Parsing Expression Grammar + Packrat-parsing is a method that is relatively efficient and can also handle more grammars than both LL and LR, but hides ambiguities, as will quickly be touched on below.

LL (Left-to-right, Leftmost derivation)

This is possibly the most natural way to think about parsing. The idea is to look at the next token in the input string and then decide which one of maybe multiple possible recursive calls should be taken to generate a tree structure.

This tree is built 'top-down', meaning that we start at the root of the tree, and travel the grammar rules in the same way as we travel through the input string. It can also be seen as constructing a 'postfix' equivalent for the 'infix' token stream that is being read.

Parsers performing LL-style parsing can be written to look very much like the original grammar that was specified. This makes it relatively easy to understand, debug and enhance them. Classical Parser Combinators are nothing more than 'lego pieces' that can be put together to build an LL-style parser.

LR (Left-to-right, Rightmost derivation)

LR parsing travels the other way, bottom-up: At each step, the top element(s) on the stack are compared to the list of grammar, to see if they could be reduced to a higher-level rule in the grammar. If not, the next token from the input stream is shift ed and placed on top of the stack.

A program is correct if at the end we end up with a single node on the stack which represents the starting rule from our grammar.

Lookahead

In either of these two systems, it sometimes is necessary to peek at more tokens from the input before being able to decide which choice to make. This is the (0), (1), (k) or (*)-syntax you see after the names of these two general algorithms, such as LR(1) or LL(k). k usually stands for 'as much as your grammar needs', while * usually stands for 'this parser performs backtracking', which is more powerful/easy to implement, but has a much higher memory and time usage than a parser that can just keep on parsing linearly.

Note that LR-style parsers already have many tokens on the stack when they might decide to 'look ahead', so they already have more information to dispatch on. This means that they often need less 'lookahead' than an LL-style parser for the same grammar.

LL vs. LR: Ambiguity

When reading the two descriptions above, one might wonder why LR-style parsing exists, as LL-style parsing seems a lot more natural.

However, LL-style parsing has a problem: Left Recursion.

It is very natural to write a grammar like:

expr ::= expr '+' expr | term
term ::= integer | float

But, a LL-style parser will get stuck in an infinite recursive loop when parsing this grammar: When trying out the left-most possibility of the expr rule, it recurses to this rule again without consuming any input.

There are ways to resolve this problem. The simplest is to rewrite your grammar so that this kind of recursion does not happen any more:

expr ::= term expr_rest
expr_rest ::= '+' expr | ϵ
term ::= integer | float

(Here, ϵ stands for the 'empty string')

This grammar now is right recursive. Note that it immediately is a lot more difficult to read.

In practice, left-recursion might happen indirectly with many other steps in-between. This makes it a hard problem to look out for. But trying to solve it makes your grammar harder to read.

As Section 2.5 of the Dragon Book states:

We appear to have a conflict: on the one hand we need a grammar that facilitates translation, on the other hand we need a significantly different grammar that facilitates parsing. The solution is to begin with the grammar for easy translation and carefully transform it to facilitate parsing. By eliminating the left recursion we can obtain a grammar suitable for use in a predictive recursive-descent translator.

LR-style parsers do not have the problem of this left-recursion, as they build the tree from the bottom-up. However, the mental translation of a grammar like above to an LR-style parser (which is often implemented as a Finite-State Automaton)
is very hard (and error-prone) to do, as often there are hundreds or thousands of states + state transitions to consider. This is why LR-style parsers are usually generated by a Parser Generator, which is also known as a 'compiler compiler'.

How to resolve Ambiguities

We saw two methods to resolve Left-recursion ambiguities above: 1) rewrite the syntax 2) use an LR-parser.

But there are other kinds of ambiguities which are harder to solve: What if two different rules are equally applicable at the same time?

Some common examples are:

arithmetic expressions.
the Dangling Else

Both LL-style and LR-style parsers have problems with these. Problems with parsing arithmetic expressions can be solved by introducing operator precedence. In a similar way, other problems like the Dangling Else can be solved, by picking one precedence behaviour and sticking with it. (In C/C++, for instance, the dangling else always belongs to the closest 'if').

Another 'solution' to this is to use Parser Expression Grammar (PEG): This is similar to the BNF-grammar used above, but in the case of an ambiguity, always 'pick the first'. Of course, this does not really 'solve' the problem, but rather hide that an ambiguity actually exists: The end users might not know which choice the parser makes, and this might lead to unexpected results.

More information that is a whole lot more in-depth than this post, including why it is impossible in general to know if your grammar does not have any ambiguities and the implications of this is the wonderful blog article LL and LR in context: Why parsing tools are hard. I can highly recommend it; it helped me a lot to understand all the things I am talking about right now.

50 years of research

But life goes on. It turned out that 'normal' LR-style parsers implemented as finite state automatons often needed thousands of states + transitions, which was a problem in program size. So, variants such as Simple LR (SLR) and LALR (Look-ahead LR) were written that combine other techniques to make the automaton smaller, reducing the disk and memory footprint of the parser programs.

Also, another way to resolve the ambiguities listed above is to use generalized techniques in which, in the case of an ambiguity, both possibilities are kept and parsed: Either one might fail to parse down the line (in which case the other possibility is the 'correct' one), as well as returning both (and in this way showing that an ambiguity exists) in the case they both are correct.

Interestingly, after the Generalized LR algorithm was described, it turned out that a similar approach could be used to implement Generalized LL parsers, which is similarly fast ( $O(n^3)$ time complexity for ambiguous grammars, $ O(n) $ for completely unambiguous grammars, albeit with more bookkeeping than a simple (LA)LR parser, which means a higher constant-factor) but again allow a parser to be written in recursive descent (top-down) style that is a lot more natural to write and debug.

Parser Combinators, Parser Generators

So, with this long exposition, we are now arriving at the core of the question:

What is the difference of Parser Combinators and Parser Generators, and when should one be used over the other?

They are really different kinds of beasts:

Parser Combinators were created because people were writing top-down parsers and realized that many of these had a lot in common.

Parser Generators were created because people were looking to build parsers that did not have the problems that LL-style parsers had (i.e. LR-style parsers), which proved very hard to do by hand. Common ones include Yacc/Bison, that implement (LA)LR).

Interestingly, nowadays the landscape is muddled somewhat:

It is possible to write Parser Combinators that work with the GLL algorithm, resolving the ambiguity-issues that classical LL-style parsers had, while being just as readable/understandable as all kinds of top-down parsing.
Parser Generators can also be written for LL-style parsers. ANTLR does exactly that, and uses other heuristics (Adaptive LL(*)) to resolve the ambiguities that classical LL-style parsers had.

In general, creating an LR parser generator and and debugging the output of an (LA)LR-style parser generator running on your grammar is difficult, because of the translation of your original grammar to the 'inside-out' LR form. On the other hand, tools like Yacc/Bison have had many years of optimisations, and seen a lot of use in the wild, which means that many people now consider it as the way to do parsing and are sceptical towards new approaches.

Which one you should use, depends on how hard your grammar is, and how fast the parser needs to be. Depending on the grammar, one of these techniques (/implementations of the different techniques) might be faster, have a smaller memory footprint, have a smaller disk footprint, or be more extensible or easier to debug than the others. Your Mileage May Vary.

Side note: On the subject of Lexical Analysis.

Lexical Analysis can be used both for Parser Combinators and Parser Generators. The idea is to have a 'dumb' parser that is very easy to implement (and therefore fast) that performs a first pass over your source code, removing for instance repeating white spaces, comments, etc, and possibly 'tokenizing' in a very coarse way the different elements that make up your language.

The main advantage is that this first step makes the real parser a lot simpler (and because of that possibly faster). The main disadvantage is that you have a separate translation step, and e.g. error reporting with line- and column numbers becomes harder because of the removal of white-space.

A lexer in the end is 'just' another parser and can be implemented using any of the techniques above. Because of its simplicity, often other techniques are used than for the main parser, and for instance extra 'lexer generators' exist.

Tl;Dr:

Here is a flowchart that is applicable to most cases: