Compiler Lexer – Why Implement as a 2D Array and Giant Switch?

compilerlexerpragmatism

I'm slowly working to finish my degree, and this semester is Compilers 101. We're using the Dragon Book. Shortly into the course and we're talking about lexical analysis and how it can be implemented via deterministic finite automata (hereafter, DFA). Set up your various lexer states, define transitions between them, etc.

But both the professor and the book propose implementing them via transition tables which amount to a giant 2d array (the various non-terminal states as one dimension, and the possible input symbols as the other) and a switch statement to handle all of the terminals as well as dispatch to the transition tables if in a non-terminal state.

The theory is all well and good, but as someone who's actually written code for decades, the implementation is vile. It's not testable, it's not maintainable, it's not readable, and it's a pain and a half to debug through. Worse yet, I can't see how it would be remotely practical if the language was UTF capable. Having a million or so transition table entries per non-terminal state gets unweildy in a hurry.

So what's the deal? Why is the definitive book on the subject saying to do it this way?

Is the overhead of function calls really that much? Is this something that works well or is necessary when the grammar isn't known ahead of time (regular expressions?)? Or perhaps something that handles all cases, even if more specific solutions will work better for more specific grammars?

(note: possible duplicate "Why use an OO approach instead of a giant switch statement?" is close, but I don't care about OO. A functional approach or even saner imperative approach with standalone functions would be fine.)

And for the sake of example, consider a language that only has identifiers, and those identifiers are [a-zA-Z]+. In the DFA implementation, you'd get something like:

private enum State
{
    Error = -1,
    Start = 0,
    IdentifierInProgress = 1,
    IdentifierDone = 2
}

private static State[][] transition = new State[][]{
    ///* Start */                  new State[]{ State.Error, State.Error (repeat until 'A'), State.IdentifierInProgress, ...
    ///* IdentifierInProgress */   new State[]{ State.IdentifierDone, State.IdentifierDone (repeat until 'A'), State.IdentifierInProgress, ...
    ///* etc. */
};

public static string NextToken(string input, int startIndex)
{
    State currentState = State.Start;
    int currentIndex = startIndex;
    while (currentIndex < input.Length)
    {
        switch (currentState)
        {
            case State.Error:
                // Whatever, example
                throw new NotImplementedException();
            case State.IdentifierDone:
                return input.Substring(startIndex, currentIndex - startIndex);
            default:
                currentState = transition[(int)currentState][input[currentIndex]];
                currentIndex++;
                break;
        }
    }

    return String.Empty;
}

(though something that would handle end of file correctly)

Compared to what I would expect:

public static string NextToken(string input, int startIndex)
{
    int currentIndex = startIndex;
    while (currentIndex < startIndex && IsLetter(input[currentIndex]))
    {
        currentIndex++;
    }

    return input.Substring(startIndex, currentIndex - startIndex);
}

public static bool IsLetter(char c)
{
    return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'));
}

With the code in NextToken refactored out into its own function once you have multiple destinations from the start of the DFA.

Best Answer

The motivation for the particular algorithm is largely that it's a learning exercise, so it tries to stay close to the idea of a DFA, and keep states and transitions very explicit in the code. As a rule, nobody would actually manually write any of this code anyway - you would use a tool to generate code from a grammar. And that tool wouldn't care about the readability of the code because it isn't source code, it's an output based on the definition of a grammar.

Your code is cleaner for someone maintaining a hand-written DFA, but a little farther removed from the concepts being taught.

Related Topic