Compiler Lexer – Why Implement as a 2D Array and Giant Switch?

compilerlexerpragmatism

I'm slowly working to finish my degree, and this semester is Compilers 101. We're using the Dragon Book. Shortly into the course and we're talking about lexical analysis and how it can be implemented via deterministic finite automata (hereafter, DFA). Set up your various lexer states, define transitions between them, etc.

But both the professor and the book propose implementing them via transition tables which amount to a giant 2d array (the various non-terminal states as one dimension, and the possible input symbols as the other) and a switch statement to handle all of the terminals as well as dispatch to the transition tables if in a non-terminal state.

The theory is all well and good, but as someone who's actually written code for decades, the implementation is vile. It's not testable, it's not maintainable, it's not readable, and it's a pain and a half to debug through. Worse yet, I can't see how it would be remotely practical if the language was UTF capable. Having a million or so transition table entries per non-terminal state gets unweildy in a hurry.

So what's the deal? Why is the definitive book on the subject saying to do it this way?

Is the overhead of function calls really that much? Is this something that works well or is necessary when the grammar isn't known ahead of time (regular expressions?)? Or perhaps something that handles all cases, even if more specific solutions will work better for more specific grammars?

(note: possible duplicate "Why use an OO approach instead of a giant switch statement?" is close, but I don't care about OO. A functional approach or even saner imperative approach with standalone functions would be fine.)

And for the sake of example, consider a language that only has identifiers, and those identifiers are [a-zA-Z]+. In the DFA implementation, you'd get something like:

private enum State
{
    Error = -1,
    Start = 0,
    IdentifierInProgress = 1,
    IdentifierDone = 2
}

private static State[][] transition = new State[][]{
    ///* Start */                  new State[]{ State.Error, State.Error (repeat until 'A'), State.IdentifierInProgress, ...
    ///* IdentifierInProgress */   new State[]{ State.IdentifierDone, State.IdentifierDone (repeat until 'A'), State.IdentifierInProgress, ...
    ///* etc. */
};

public static string NextToken(string input, int startIndex)
{
    State currentState = State.Start;
    int currentIndex = startIndex;
    while (currentIndex < input.Length)
    {
        switch (currentState)
        {
            case State.Error:
                // Whatever, example
                throw new NotImplementedException();
            case State.IdentifierDone:
                return input.Substring(startIndex, currentIndex - startIndex);
            default:
                currentState = transition[(int)currentState][input[currentIndex]];
                currentIndex++;
                break;
        }
    }

    return String.Empty;
}

(though something that would handle end of file correctly)

Compared to what I would expect:

public static string NextToken(string input, int startIndex)
{
    int currentIndex = startIndex;
    while (currentIndex < startIndex && IsLetter(input[currentIndex]))
    {
        currentIndex++;
    }

    return input.Substring(startIndex, currentIndex - startIndex);
}

public static bool IsLetter(char c)
{
    return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'));
}

With the code in NextToken refactored out into its own function once you have multiple destinations from the start of the DFA.

Best Answer

The motivation for the particular algorithm is largely that it's a learning exercise, so it tries to stay close to the idea of a DFA, and keep states and transitions very explicit in the code. As a rule, nobody would actually manually write any of this code anyway - you would use a tool to generate code from a grammar. And that tool wouldn't care about the readability of the code because it isn't source code, it's an output based on the definition of a grammar.

Your code is cleaner for someone maintaining a hand-written DFA, but a little farther removed from the concepts being taught.

Option 1: parser generators, or 'you need to parse some language and you just want to get it working, dammit'

Say, you're asked to build a parser for some ancient data format NOW. Or you need your parser to be fast. Or you need your parser to be easily maintainable.

In these cases, you're probably best off using a parser generator. You don't have to fiddle around with the details, you don't have to get lots of complicated code to work properly, you just write out the grammar the input will adhere to, write some handling code and presto: instant parser.

The advantages are clear:

It's (usually) quite easy to write a specification, in particular if the input format isn't too weird (option 2 would be better if it is).
You end up with a very easily maintainable piece of work that is easily understood: a grammar definition usually flows a lot more natural than code.
The parsers generated by good parser generators are usually a lot faster than hand-written code. Hand-written code can be faster, but only if you know your stuff - this is why most widely used compilers use a hand-written recursive-descent parser.

There's one thing you have to be careful of with parser-generators: the can sometimes reject your grammars. For an overview of the different types of parsers and how they can bite you, you may want to start here. Here you can find an overview of a lot of implementations and the types of grammars they accept.

Option 2: hand-written parsers, or 'you want to build your own parser, and you care about being user-friendly'

Parser generators are nice, but they aren't very user (the end-user, not you) friendly. You typically can't give good error messages, nor can you provide error recovery. Perhaps your language is very weird and parsers reject your grammar or you need more control than the generator gives you.

In these cases, using a hand-written recursive-descent parser is probably the best. While getting it right may be complicated, you have complete control over your parser so you can do all kinds of nice stuff you can't do with parser generators, like error messages and even error recovery (try removing all the semicolons from a C# file: the C# compiler will complain, but will detect most other errors anyway regardless of the presence of semicolons).

Hand-written parsers also usually perform better than generated ones, assuming the quality of the parser is high enough. On the other hand, if you don't manage to write a good parser - usually due to (a combination of) lack of experience, knowledge or design - then performance is usually slower. For lexers the opposite is true though: generally generated lexers use table lookups, making them faster than (most) hand-written ones.

Education-wise, writing your own parser will teach you more than using a generator. You have to write more and more complicated code after all, plus you have to understand exactly how you parse a language. On the other hand, if you want to learn how to create your own language (so, get experience at language design), either option 1 or option 3 is preferable: if you're developing a language, it will probably change a lot, and option 1 and 3 give you an easier time with that.

Option 3: hand written parser generators, or 'you're trying to learn a lot from this project and you wouldn't mind ending up with a nifty piece of code you can re-use a lot'

This is the path I'm currently walking down: you write your own parser generator. While highly nontrivial, doing this will probably teach you the most.

To give you an idea what doing a project like this involves I'll tell you about my own progress.

The lexer generator

I created my own lexer generator first. I usually design software starting with how the code will be used, so I thought about how I wanted to be able to use my code and wrote this piece of code (it's in C#):

Lexer<CalculatorToken> calculatorLexer = new Lexer<CalculatorToken>(
    new List<StringTokenPair>()
    { // This is just like a lex specification:
      //                    regex   token
        new StringTokenPair("\\+",  CalculatorToken.Plus),
        new StringTokenPair("\\*",  CalculatorToken.Times),
        new StringTokenPair("(",    CalculatorToken.LeftParenthesis),
        new StringTokenPair(")",    CalculatorToken.RightParenthesis),
        new StringTokenPair("\\d+", CalculatorToken.Number),
    });

foreach (CalculatorToken token in
             calculatorLexer.GetLexer(new StringReader("15+4*10")))
{ // This will iterate over all tokens in the string.
    Console.WriteLine(token.Value);
}

// Prints:
// 15
// +
// 4
// *
// 10

The input string-token pairs are converted into a corresponding recursive structure describing the regular expressions they represent using the ideas of an arithmetic stack. This is then converted into a NFA (nondeterministic finite automaton), which is in turn converted into a DFA (deterministic finite automaton). You can then match strings against the DFA.

This way, you get a good idea how exactly lexers work. In addition, if you do it the right way the results from your lexer generator can be roughly as fast as professional implementations. You also don't lose any expressiveness compared to option 2, and not much expressiveness compared to option 1.

I implemented my lexer generator in just over 1600 lines of code. This code makes the above work, but it still generates the lexer on the fly every time you start the program: I'm going to add code to write it to disk at some point.

If you want to know how to write your own lexer, this is a good place to start.

The parser generator

You then write your parser generator. I refer to here again for an overview on the different kinds of parsers - as a rule of thumb, the more they can parse, the slower they are.

Speed not being an issue for me, I chose to implement an Earley parser. Advanced implementations of an Earley parser have been shown to be about twice as slow as other parser types.

In return for that speed hit, you get the ability to parse any kind of grammar, even ambiguous ones. This means you never need to worry about whether your parser has any left-recursion in it, or what a shift-reduce conflict is. You can also define grammars more easily using ambiguous grammars if it doesn't matter which parse tree is the result, such as that it doesn't matter whether you parse 1+2+3 as (1+2)+3 or as 1+(2+3).

This is what a piece of code using my parser generator can look like:

Lexer<CalculatorToken> calculatorLexer = new Lexer<CalculatorToken>(
    new List<StringTokenPair>()
    {
        new StringTokenPair("\\+",  CalculatorToken.Plus),
        new StringTokenPair("\\*",  CalculatorToken.Times),
        new StringTokenPair("(",    CalculatorToken.LeftParenthesis),
        new StringTokenPair(")",    CalculatorToken.RightParenthesis),
        new StringTokenPair("\\d+", CalculatorToken.Number),
    });

Grammar<IntWrapper, CalculatorToken> calculator
    = new Grammar<IntWrapper, CalculatorToken>(calculatorLexer);

// Declaring the nonterminals.
INonTerminal<IntWrapper> expr = calculator.AddNonTerminal<IntWrapper>();
INonTerminal<IntWrapper> term = calculator.AddNonTerminal<IntWrapper>();
INonTerminal<IntWrapper> factor = calculator.AddNonTerminal<IntWrapper>();

// expr will be our head nonterminal.
calculator.SetAsMainNonTerminal(expr);

// expr: term | expr Plus term;
calculator.AddProduction(expr, term.GetDefault());
calculator.AddProduction(expr,
                         expr.GetDefault(),
                         CalculatorToken.Plus.GetDefault(),
                         term.AddCode(
                         (x, r) => { x.Result.Value += r.Value; return x; }
                         ));

// term: factor | term Times factor;
calculator.AddProduction(term, factor.GetDefault());
calculator.AddProduction(term,
                         term.GetDefault(),
                         CalculatorToken.Times.GetDefault(),
                         factor.AddCode
                         (
                         (x, r) => { x.Result.Value *= r.Value; return x; }
                         ));

// factor: LeftParenthesis expr RightParenthesis
//         | Number;
calculator.AddProduction(factor,
                         CalculatorToken.LeftParenthesis.GetDefault(),
                         expr.GetDefault(),
                         CalculatorToken.RightParenthesis.GetDefault());
calculator.AddProduction(factor,
                         CalculatorToken.Number.AddCode
                         (
                         (x, s) => { x.Result = new IntWrapper(int.Parse(s));
                                     return x; }
                         ));

IntWrapper result = calculator.Parse("15+4*10");
// result == 55

(Note that IntWrapper is simply an Int32, except that C# requires it to be a class, hence I had to introduce a wrapper class)

I hope you see that the code above is very powerful: any grammar you can come up with can be parsed. You can add arbitrary bits of code in the grammar capable of performing lots of tasks. If you manage to get this all working, you can re-use the resulting code to do a lot of tasks very easily: just imagine building a command-line interpreter using this piece of code.

GCC Parser – Why Did GCC Switch from Bison to a Recursive Descent Parser?

GCC switched to hand-written parsing because error messages are more meaningful when using recursive descent techniques, as I explained here.

Also, C++ is becoming such a (syntactically) complex language to parse that using parser generators is not worthwhile for it.

At last, the bulk of the work of a real compiler is not parsing, it is optimizing. GCC middle end optimization passes are much more complex than its parsing.

(BTW you can customize GCC e.g. with plugins or using MELT, but you cannot really extend the syntax of the language it is accepting - except by adding attributes and pragmas).

Best Answer

Related Solutions

Language Design – Parser Generator vs Custom Lexer and Parser

Option 1: parser generators, or 'you need to parse some language and you just want to get it working, dammit'

Option 2: hand-written parsers, or 'you want to build your own parser, and you care about being user-friendly'

Option 3: hand written parser generators, or 'you're trying to learn a lot from this project and you wouldn't mind ending up with a nifty piece of code you can re-use a lot'

GCC Parser – Why Did GCC Switch from Bison to a Recursive Descent Parser?

Related Topic