Language Design – Parser Generator vs Custom Lexer and Parser

compilerlanguage-designparsing

What specific advantages and disadvantages of each way to working on a programming language grammar?

Why/When should I roll my own? Why/When should I use a generator?

Best Answer

There are three options really, all three of them preferable in different situations.

Option 1: parser generators, or 'you need to parse some language and you just want to get it working, dammit'

Say, you're asked to build a parser for some ancient data format NOW. Or you need your parser to be fast. Or you need your parser to be easily maintainable.

In these cases, you're probably best off using a parser generator. You don't have to fiddle around with the details, you don't have to get lots of complicated code to work properly, you just write out the grammar the input will adhere to, write some handling code and presto: instant parser.

The advantages are clear:

  • It's (usually) quite easy to write a specification, in particular if the input format isn't too weird (option 2 would be better if it is).
  • You end up with a very easily maintainable piece of work that is easily understood: a grammar definition usually flows a lot more natural than code.
  • The parsers generated by good parser generators are usually a lot faster than hand-written code. Hand-written code can be faster, but only if you know your stuff - this is why most widely used compilers use a hand-written recursive-descent parser.

There's one thing you have to be careful of with parser-generators: the can sometimes reject your grammars. For an overview of the different types of parsers and how they can bite you, you may want to start here. Here you can find an overview of a lot of implementations and the types of grammars they accept.

Option 2: hand-written parsers, or 'you want to build your own parser, and you care about being user-friendly'

Parser generators are nice, but they aren't very user (the end-user, not you) friendly. You typically can't give good error messages, nor can you provide error recovery. Perhaps your language is very weird and parsers reject your grammar or you need more control than the generator gives you.

In these cases, using a hand-written recursive-descent parser is probably the best. While getting it right may be complicated, you have complete control over your parser so you can do all kinds of nice stuff you can't do with parser generators, like error messages and even error recovery (try removing all the semicolons from a C# file: the C# compiler will complain, but will detect most other errors anyway regardless of the presence of semicolons).

Hand-written parsers also usually perform better than generated ones, assuming the quality of the parser is high enough. On the other hand, if you don't manage to write a good parser - usually due to (a combination of) lack of experience, knowledge or design - then performance is usually slower. For lexers the opposite is true though: generally generated lexers use table lookups, making them faster than (most) hand-written ones.

Education-wise, writing your own parser will teach you more than using a generator. You have to write more and more complicated code after all, plus you have to understand exactly how you parse a language. On the other hand, if you want to learn how to create your own language (so, get experience at language design), either option 1 or option 3 is preferable: if you're developing a language, it will probably change a lot, and option 1 and 3 give you an easier time with that.

Option 3: hand written parser generators, or 'you're trying to learn a lot from this project and you wouldn't mind ending up with a nifty piece of code you can re-use a lot'

This is the path I'm currently walking down: you write your own parser generator. While highly nontrivial, doing this will probably teach you the most.

To give you an idea what doing a project like this involves I'll tell you about my own progress.

The lexer generator

I created my own lexer generator first. I usually design software starting with how the code will be used, so I thought about how I wanted to be able to use my code and wrote this piece of code (it's in C#):

Lexer<CalculatorToken> calculatorLexer = new Lexer<CalculatorToken>(
    new List<StringTokenPair>()
    { // This is just like a lex specification:
      //                    regex   token
        new StringTokenPair("\\+",  CalculatorToken.Plus),
        new StringTokenPair("\\*",  CalculatorToken.Times),
        new StringTokenPair("(",    CalculatorToken.LeftParenthesis),
        new StringTokenPair(")",    CalculatorToken.RightParenthesis),
        new StringTokenPair("\\d+", CalculatorToken.Number),
    });

foreach (CalculatorToken token in
             calculatorLexer.GetLexer(new StringReader("15+4*10")))
{ // This will iterate over all tokens in the string.
    Console.WriteLine(token.Value);
}

// Prints:
// 15
// +
// 4
// *
// 10

The input string-token pairs are converted into a corresponding recursive structure describing the regular expressions they represent using the ideas of an arithmetic stack. This is then converted into a NFA (nondeterministic finite automaton), which is in turn converted into a DFA (deterministic finite automaton). You can then match strings against the DFA.

This way, you get a good idea how exactly lexers work. In addition, if you do it the right way the results from your lexer generator can be roughly as fast as professional implementations. You also don't lose any expressiveness compared to option 2, and not much expressiveness compared to option 1.

I implemented my lexer generator in just over 1600 lines of code. This code makes the above work, but it still generates the lexer on the fly every time you start the program: I'm going to add code to write it to disk at some point.

If you want to know how to write your own lexer, this is a good place to start.

The parser generator

You then write your parser generator. I refer to here again for an overview on the different kinds of parsers - as a rule of thumb, the more they can parse, the slower they are.

Speed not being an issue for me, I chose to implement an Earley parser. Advanced implementations of an Earley parser have been shown to be about twice as slow as other parser types.

In return for that speed hit, you get the ability to parse any kind of grammar, even ambiguous ones. This means you never need to worry about whether your parser has any left-recursion in it, or what a shift-reduce conflict is. You can also define grammars more easily using ambiguous grammars if it doesn't matter which parse tree is the result, such as that it doesn't matter whether you parse 1+2+3 as (1+2)+3 or as 1+(2+3).

This is what a piece of code using my parser generator can look like:

Lexer<CalculatorToken> calculatorLexer = new Lexer<CalculatorToken>(
    new List<StringTokenPair>()
    {
        new StringTokenPair("\\+",  CalculatorToken.Plus),
        new StringTokenPair("\\*",  CalculatorToken.Times),
        new StringTokenPair("(",    CalculatorToken.LeftParenthesis),
        new StringTokenPair(")",    CalculatorToken.RightParenthesis),
        new StringTokenPair("\\d+", CalculatorToken.Number),
    });

Grammar<IntWrapper, CalculatorToken> calculator
    = new Grammar<IntWrapper, CalculatorToken>(calculatorLexer);

// Declaring the nonterminals.
INonTerminal<IntWrapper> expr = calculator.AddNonTerminal<IntWrapper>();
INonTerminal<IntWrapper> term = calculator.AddNonTerminal<IntWrapper>();
INonTerminal<IntWrapper> factor = calculator.AddNonTerminal<IntWrapper>();

// expr will be our head nonterminal.
calculator.SetAsMainNonTerminal(expr);

// expr: term | expr Plus term;
calculator.AddProduction(expr, term.GetDefault());
calculator.AddProduction(expr,
                         expr.GetDefault(),
                         CalculatorToken.Plus.GetDefault(),
                         term.AddCode(
                         (x, r) => { x.Result.Value += r.Value; return x; }
                         ));

// term: factor | term Times factor;
calculator.AddProduction(term, factor.GetDefault());
calculator.AddProduction(term,
                         term.GetDefault(),
                         CalculatorToken.Times.GetDefault(),
                         factor.AddCode
                         (
                         (x, r) => { x.Result.Value *= r.Value; return x; }
                         ));

// factor: LeftParenthesis expr RightParenthesis
//         | Number;
calculator.AddProduction(factor,
                         CalculatorToken.LeftParenthesis.GetDefault(),
                         expr.GetDefault(),
                         CalculatorToken.RightParenthesis.GetDefault());
calculator.AddProduction(factor,
                         CalculatorToken.Number.AddCode
                         (
                         (x, s) => { x.Result = new IntWrapper(int.Parse(s));
                                     return x; }
                         ));

IntWrapper result = calculator.Parse("15+4*10");
// result == 55

(Note that IntWrapper is simply an Int32, except that C# requires it to be a class, hence I had to introduce a wrapper class)

I hope you see that the code above is very powerful: any grammar you can come up with can be parsed. You can add arbitrary bits of code in the grammar capable of performing lots of tasks. If you manage to get this all working, you can re-use the resulting code to do a lot of tasks very easily: just imagine building a command-line interpreter using this piece of code.

Related Topic