Scala – Right Way to Read from Token List to Implement Parse Tree

parsingscala

I'm trying to tackle the While Language problem on HackerRank using Scala. I've been given a set of grammar rules and have to implement the actual interpreter underlying it.

I've decided to approach this by building a lexer and parser in two separate stages. My lexer works perfectly: it correctly tags tokens and wraps them in a List of Tokens where Tokens is defined as follows:

case class Token(typeof: String, value: Any);

Right now, my problem is purely conceptual: how should I scan through my token list to implement a parse tree?

Note that I don't yet have any implementations for the actual nodes of my parse tree, as I want to solve this problem first (though I'm happy to hear suggestions for the same if it is indispensable to solving this).

Here is what I've done so far:

  • I've taken a look at the shunting-yard algorithm, but I'm not sure how to generalise it for anything beyond arithmetic operations. I thought of having multiple stacks designed for different node types, but I'm not convinced it's the right way to do it – this would easily become unmaintainable.

  • I've considered a naive linear traversal where I create an empty list, insert my tokens from my token list one by one, and go through this stack each time trying to match a grammar rule to it (if it does match, I remove the nodes from the list otherwise I leave it be), but I'm hesitant to do this because it seems a) inefficient and b) defeats the point of a separate lexing phase. I'm looking for something cleaner.

Beyond this, I'm stuck. I'm new to the parsing game and would like to understand what the right approach is.

HackerRank only supports Scala 2.11, which officially removed the parser-combinator library from the standard library, so I can't use that. I'm happy to employ any functional features that Scala brings, but I'm not interested in relying on something that only functional languages implement.


For the record, here is my Lexer code:

import scala.collection.mutable.HashMap

class Lexer {
    // defining keywords and tokens
    val parenthesis_mapping = HashMap("""(""" -> """)""","""{""" -> """}""");
    val keywords = List("if", "then", "else", "while", "do", "true", "false");
    val boolean_operators = List("and","or")
    val arithmetic_operators = List("""+""","""-""","""/""","""*""",""">""","""<""")
    val variables = """([a-z]+)""".r
    val numbers = """([0-9]*\.?[0-9]+)""".r

    // converts string token to corresponding Token type
    def convert(input : String): Token = {
        if (keywords.contains(input)) { return Token("KEYWORD",input) }
        else if (boolean_operators.contains(input)) { return Token("BOOLEAN",input) }
        else if (arithmetic_operators.contains(input)) { return Token("ARITHMETIC", input) }
        else if (parenthesis_mapping.keys.exists(_ == input)) { return Token("OPENING", input) }
        else if (parenthesis_mapping.values.exists(_ == input)) { return Token("CLOSING", input) }
        else { return input match {
            case variables(output) => Token("VARIABLE", output)
            case """:=""" => Token("ASSIGNMENT", null)
            case numbers(output) => Token("NUMBER", output.toInt)
            case """;""" => Token("BREAK", null)
            case _ => Token("NULL", input)
        } }
    }

   // splits input string and applies `convert` to each token
   def lex(strings: String): List[Token] = {...} // only providing signature here
  }

and here is the grammar I have to implement

Below is the description of grammar that we will use:

  • x, y ∈ Var (variables)

  • n ∈ Num (numerals/integers)

  • op_{a} ∈ Opa (arithmetic operators)
    ob_{a} ::= + | - | * | /

  • op_{b} ∈ Opb (boolean operators)
    op_{b} ::= and | or

  • op_{r} ∈ Opr (relational operators)
    op_{r} ::= > | <

  • a ∈ AExp (arithmetic expressions)
    a ::= x | n | a1 opa a2 | ( a )

  • b ∈ BExp (boolean expressions)
    b ::= true | false | b1 opb b2 | a1 opr a2 | ( b )

  • S ∈ Stmt (statements)
    S ::= x := a | S1 ; S2 | if b then { S1 } else { S2 } | while b do { S }

Here all operators are left associative. Their precedence order is as
follows.

  1. Arithmetic Operators: (*, /) > (+, -) > (>, <)
  2. Boolean Operators: and > or.

You can safely assume that all variables have integer type and are initialized properly. All variables name will consist of only
lowercase letter ('a'-'z') and it's length will not exceed 10.

Note that ";" is more like of a sequencing operator. It is used to
concatenate two statements. That's why there will be no ";" at the end
of block of statements.

An example of the resulting syntax:

fact := 1 ;
val := 10000 ;
cur := val ;
mod := 1000000007 ;

while ( cur > 1 )
  do
   {
      fact := fact * cur ;
      fact := fact - fact / mod * mod ;
      cur := cur - 1
   } ;

cur := 0

Best Answer

I would suggest you use your Lexer construct as more of a generator, rather than identifying all the tokens at once and sending an entire list to the parser. Your lexer could simply declare an interface with a .NextToken() or .Next() (and probably .Backup()) method and anytime the Parser requires a new token, it could simply call one of those methods.

If your language requires lookahead tokens, you could simply keep a list of lookahead tokens on the Parser and always lag N number of tokens behind the Lexer's current token.

As far as actual parsing techniques are concerned, you should probably consider recursive descent parsing. In my opinion, these are typically the easiest types of parsers to code by hand and are conceptually the simplest parsers to understand.

Recursive descent parsing typically works by matching one terminal grammar production to a function or method. So if you have a grammar production for a variable declaration like var x := 1, your method could look something like this:

// Grammar Production:
//   'var' Identifier ':=' Expression
private Declaration ParseDeclaration() {
    Declaration decl = new Declaration();

    if (!Expect(TokenType.Var)) {
        throw new ParseError("Expected 'var' keyword");
    }

    decl.Identifier = ParseIdentifier();

    if (!Expect(TokenType.AssignmentEquals)) {
        throw new ParseError("Expected token ':='");
    }

    decl.Expression = ParseExpression();
    return decl;
}

For each production in your grammar, you could define a method like that one and recursively construct your parse tree. It ends up being a really elegant system to code, rather than trying to match productions using token patterns. You can use the power of your programming language's call stack to control flow and handle repetition and parsing errors. Shunting yard is challenging to generalize to non expression grammar rules too, so I don't recommend you go down that path.

P.S. I'm sorry, I don't know Scala syntax offhand, so I opted to just use a Java/C# type of syntax.

Related Topic