Java – Do modern languages still use parser generators

haskelljavaparsingrubyswift-language

I was researching about the gcc compiler suite on wikipedia here, when this came up:

GCC started out using LALR parsers generated with Bison, but gradually switched to hand-written recursive-descent parsers; for C++ in 2004, and for C and Objective-C in 2006. Currently all front ends use hand-written recursive-descent parsers

So by that last sentence, (and for as much as I trust wikipedia) I can definitely say that
"C (gcc), C++ (g++), Objective-C, Objective-C++, Fortran (gfortran), Java (gcj), Ada (GNAT), Go (gccgo), Pascal (gpc),… Mercury, Modula-2, Modula-3, PL/I, D (gdc), and VHDL (ghdl)" are all front-ends that no longer use a parser generator. That is, they all use hand-written parsers.

My question then is, is this practice ubiquitous? Specifically, I'm looking for exact answers to "does the standard/official implementation of x have a hand-written parser" for x in [Python, Swift, Ruby, Java, Scala, ML, Haskell]? (Actually, information on any other languages is also welcome here.) I'm sure I can find this on my own after a lot of digging. But I'm also sure this is easily answerable by the community. Thanks!

Best Answer

AFAIK, GCC use hand-written parsers in particular to improve syntactic error diagnostics (i.e. giving human meaningful messages on syntax errors).

Parsing theory (and the parsing generators descending from it) is mostly about recognizing and parsing a correct input phrase. But we expect from compilers that they give a meaningful error message (and that they are able to parse meaningfully the rest of the input after the syntactic error), for some incorrect input.

Also, old legacy languages -like C11 or C++11- (which are conceptually old, even if their latest revision is only three years old) are not at all context-free. Dealing with that context sensitiveness in grammars for parser generators (i.e. bison or even menhir) is boringly difficult.

Option 1: parser generators, or 'you need to parse some language and you just want to get it working, dammit'

Say, you're asked to build a parser for some ancient data format NOW. Or you need your parser to be fast. Or you need your parser to be easily maintainable.

In these cases, you're probably best off using a parser generator. You don't have to fiddle around with the details, you don't have to get lots of complicated code to work properly, you just write out the grammar the input will adhere to, write some handling code and presto: instant parser.

The advantages are clear:

It's (usually) quite easy to write a specification, in particular if the input format isn't too weird (option 2 would be better if it is).
You end up with a very easily maintainable piece of work that is easily understood: a grammar definition usually flows a lot more natural than code.
The parsers generated by good parser generators are usually a lot faster than hand-written code. Hand-written code can be faster, but only if you know your stuff - this is why most widely used compilers use a hand-written recursive-descent parser.

There's one thing you have to be careful of with parser-generators: the can sometimes reject your grammars. For an overview of the different types of parsers and how they can bite you, you may want to start here. Here you can find an overview of a lot of implementations and the types of grammars they accept.

Option 2: hand-written parsers, or 'you want to build your own parser, and you care about being user-friendly'

Parser generators are nice, but they aren't very user (the end-user, not you) friendly. You typically can't give good error messages, nor can you provide error recovery. Perhaps your language is very weird and parsers reject your grammar or you need more control than the generator gives you.

In these cases, using a hand-written recursive-descent parser is probably the best. While getting it right may be complicated, you have complete control over your parser so you can do all kinds of nice stuff you can't do with parser generators, like error messages and even error recovery (try removing all the semicolons from a C# file: the C# compiler will complain, but will detect most other errors anyway regardless of the presence of semicolons).

Hand-written parsers also usually perform better than generated ones, assuming the quality of the parser is high enough. On the other hand, if you don't manage to write a good parser - usually due to (a combination of) lack of experience, knowledge or design - then performance is usually slower. For lexers the opposite is true though: generally generated lexers use table lookups, making them faster than (most) hand-written ones.

Education-wise, writing your own parser will teach you more than using a generator. You have to write more and more complicated code after all, plus you have to understand exactly how you parse a language. On the other hand, if you want to learn how to create your own language (so, get experience at language design), either option 1 or option 3 is preferable: if you're developing a language, it will probably change a lot, and option 1 and 3 give you an easier time with that.

Option 3: hand written parser generators, or 'you're trying to learn a lot from this project and you wouldn't mind ending up with a nifty piece of code you can re-use a lot'

This is the path I'm currently walking down: you write your own parser generator. While highly nontrivial, doing this will probably teach you the most.

To give you an idea what doing a project like this involves I'll tell you about my own progress.

The lexer generator

I created my own lexer generator first. I usually design software starting with how the code will be used, so I thought about how I wanted to be able to use my code and wrote this piece of code (it's in C#):

Lexer<CalculatorToken> calculatorLexer = new Lexer<CalculatorToken>(
    new List<StringTokenPair>()
    { // This is just like a lex specification:
      //                    regex   token
        new StringTokenPair("\\+",  CalculatorToken.Plus),
        new StringTokenPair("\\*",  CalculatorToken.Times),
        new StringTokenPair("(",    CalculatorToken.LeftParenthesis),
        new StringTokenPair(")",    CalculatorToken.RightParenthesis),
        new StringTokenPair("\\d+", CalculatorToken.Number),
    });

foreach (CalculatorToken token in
             calculatorLexer.GetLexer(new StringReader("15+4*10")))
{ // This will iterate over all tokens in the string.
    Console.WriteLine(token.Value);
}

// Prints:
// 15
// +
// 4
// *
// 10

The input string-token pairs are converted into a corresponding recursive structure describing the regular expressions they represent using the ideas of an arithmetic stack. This is then converted into a NFA (nondeterministic finite automaton), which is in turn converted into a DFA (deterministic finite automaton). You can then match strings against the DFA.

This way, you get a good idea how exactly lexers work. In addition, if you do it the right way the results from your lexer generator can be roughly as fast as professional implementations. You also don't lose any expressiveness compared to option 2, and not much expressiveness compared to option 1.

I implemented my lexer generator in just over 1600 lines of code. This code makes the above work, but it still generates the lexer on the fly every time you start the program: I'm going to add code to write it to disk at some point.

If you want to know how to write your own lexer, this is a good place to start.

The parser generator

You then write your parser generator. I refer to here again for an overview on the different kinds of parsers - as a rule of thumb, the more they can parse, the slower they are.

Speed not being an issue for me, I chose to implement an Earley parser. Advanced implementations of an Earley parser have been shown to be about twice as slow as other parser types.

In return for that speed hit, you get the ability to parse any kind of grammar, even ambiguous ones. This means you never need to worry about whether your parser has any left-recursion in it, or what a shift-reduce conflict is. You can also define grammars more easily using ambiguous grammars if it doesn't matter which parse tree is the result, such as that it doesn't matter whether you parse 1+2+3 as (1+2)+3 or as 1+(2+3).

This is what a piece of code using my parser generator can look like:

Lexer<CalculatorToken> calculatorLexer = new Lexer<CalculatorToken>(
    new List<StringTokenPair>()
    {
        new StringTokenPair("\\+",  CalculatorToken.Plus),
        new StringTokenPair("\\*",  CalculatorToken.Times),
        new StringTokenPair("(",    CalculatorToken.LeftParenthesis),
        new StringTokenPair(")",    CalculatorToken.RightParenthesis),
        new StringTokenPair("\\d+", CalculatorToken.Number),
    });

Grammar<IntWrapper, CalculatorToken> calculator
    = new Grammar<IntWrapper, CalculatorToken>(calculatorLexer);

// Declaring the nonterminals.
INonTerminal<IntWrapper> expr = calculator.AddNonTerminal<IntWrapper>();
INonTerminal<IntWrapper> term = calculator.AddNonTerminal<IntWrapper>();
INonTerminal<IntWrapper> factor = calculator.AddNonTerminal<IntWrapper>();

// expr will be our head nonterminal.
calculator.SetAsMainNonTerminal(expr);

// expr: term | expr Plus term;
calculator.AddProduction(expr, term.GetDefault());
calculator.AddProduction(expr,
                         expr.GetDefault(),
                         CalculatorToken.Plus.GetDefault(),
                         term.AddCode(
                         (x, r) => { x.Result.Value += r.Value; return x; }
                         ));

// term: factor | term Times factor;
calculator.AddProduction(term, factor.GetDefault());
calculator.AddProduction(term,
                         term.GetDefault(),
                         CalculatorToken.Times.GetDefault(),
                         factor.AddCode
                         (
                         (x, r) => { x.Result.Value *= r.Value; return x; }
                         ));

// factor: LeftParenthesis expr RightParenthesis
//         | Number;
calculator.AddProduction(factor,
                         CalculatorToken.LeftParenthesis.GetDefault(),
                         expr.GetDefault(),
                         CalculatorToken.RightParenthesis.GetDefault());
calculator.AddProduction(factor,
                         CalculatorToken.Number.AddCode
                         (
                         (x, s) => { x.Result = new IntWrapper(int.Parse(s));
                                     return x; }
                         ));

IntWrapper result = calculator.Parse("15+4*10");
// result == 55

(Note that IntWrapper is simply an Int32, except that C# requires it to be a class, hence I had to introduce a wrapper class)

I hope you see that the code above is very powerful: any grammar you can come up with can be parsed. You can add arbitrary bits of code in the grammar capable of performing lots of tasks. If you manage to get this all working, you can re-use the resulting code to do a lot of tasks very easily: just imagine building a command-line interpreter using this piece of code.

Ruby on Rails – Rails Generators for Subscription and Payment Processing

Given the pain you'll experience when you start to navigate the byzantine requirements of PCI compliance, I think "best practices" in this area is not to handle payment processing in the first place.

At least, not within your application.

Stripe is an online payment processor geared toward developers. It's a breeze to accept credit card payments through their API, and it offloads responsibility (and liability) for PCI regulation issues to someone else. Which is a Good Thing.

There's even a Railscasts episode all about using Stripe to handle payments, and the stripe-rails gem will bootstrap the process for you.

Oh, and Stripe handles periodic billing (for subscriptions) for you.

Before you head down this road, I highly recommend taking a look at available technologies you might leverage. Although the code for payment processing is often fairly straightforward, the liability issues are massive.

P.S. I don't work for Stripe or anything. I just think their product/service is excellent, and a good match for your use case.

Best Answer

Related Solutions

Language Design – Parser Generator vs Custom Lexer and Parser

Option 1: parser generators, or 'you need to parse some language and you just want to get it working, dammit'

Option 2: hand-written parsers, or 'you want to build your own parser, and you care about being user-friendly'

Option 3: hand written parser generators, or 'you're trying to learn a lot from this project and you wouldn't mind ending up with a nifty piece of code you can re-use a lot'

Ruby on Rails – Rails Generators for Subscription and Payment Processing

Related Topic