Performance of parsers: PEG vs LALR(1) or LL(k)

lalrllparser-generatorparsingpeg

I've seen some claims that optimized PEG parsers in general cannot be faster than optimized LALR(1) or LL(k) parsers. (Of course, performance of parsing would depend on a particular grammar.)

I'd like to know if there are any specific limitations of PEG parsers, either valid in general or for some subsets of PEG grammars that would make them inferior to LALR(1) or
LL(k) performance-wise.

In particular, I'm interested in parser generators, but assume that their output can be tweaked for performance in any particular case. I also assume that parsers are optimized and it is possible to tweak a particular grammar a bit if that's needed to improve performance.

Best Answer

Found a good answer about Packrat vs LALR parsing. Some quotes from it:

L(AL)R parsers are linear time parsers, too. So in theory, neither packrat nor L(AL)R parsers are "faster".

What matters, in practice, of course, is implementation. L(AL)R state transitions can be executed in very few machine instructions ("look token code up in vector, get next state and action") so they can be extremely fast in practice.

An observation: most language front-ends don't spend most of their time "parsing"; rather, they spend a lot of time in lexical analysis. Optimize that ..., and the parser speed won't matter much.

Related Solutions

C++ – Why can’t C++ be parsed with a LR(1) parser

LR parsers can't handle ambiguous grammar rules, by design. (Made the theory easier back in the 1970s when the ideas were being worked out).

C and C++ both allow the following statement:

x * y ;

It has two different parses:

It can be the declaration of y, as pointer to type x
It can be a multiply of x and y, throwing away the answer.

Now, you might think the latter is stupid and should be ignored. Most would agree with you; however, there are cases where it might have a side effect (e.g., if multiply is overloaded). but that isn't the point. The point is there are two different parses, and therefore a program can mean different things depending on how this should have been parsed.

The compiler must accept the appropriate one under the appropriate circumstances, and in the absence of any other information (e.g., knowledge of the type of x) must collect both in order to decide later what to do. Thus a grammar must allow this. And that makes the grammar ambiguous.

Thus pure LR parsing can't handle this. Nor can many other widely available parser generators, such as Antlr, JavaCC, YACC, or traditional Bison, or even PEG-style parsers, used in a "pure" way.

There are lots of more complicated cases (parsing template syntax requires arbitrary lookahead, whereas LALR(k) can look ahead at most k tokens), but only it only takes one counterexample to shoot down pure LR (or the others) parsing.

Most real C/C++ parsers handle this example by using some kind of deterministic parser with an extra hack: they intertwine parsing with symbol table collection... so that by the time "x" is encountered, the parser knows if x is a type or not, and can thus choose between the two potential parses. But a parser that does this isn't context free, and LR parsers (the pure ones, etc.) are (at best) context free.

One can cheat, and add per-rule reduction-time semantic checks in the to LR parsers to do this disambiguation. (This code often isn't simple). Most of the other parser types have some means to add semantic checks at various points in the parsing, that can be used to do this.

And if you cheat enough, you can make LR parsers work for C and C++. The GCC guys did for awhile, but gave it up for hand-coded parsing, I think because they wanted better error diagnostics.

There's another approach, though, which is nice and clean and parses C and C++ just fine without any symbol table hackery: GLR parsers. These are full context free parsers (having effectively infinite lookahead). GLR parsers simply accept both parses, producing a "tree" (actually a directed acyclic graph that is mostly tree like) that represents the ambiguous parse. A post-parsing pass can resolve the ambiguities.

We use this technique in the C and C++ front ends for our DMS Software Reengineering Tookit (as of June 2017 these handle full C++17 in MS and GNU dialects). They have been used to process millions of lines of large C and C++ systems, with complete, precise parses producing ASTs with complete details of the source code. (See the AST for C++'s most vexing parse.)

Best Answer

Related Solutions

C++ – Why can’t C++ be parsed with a LR(1) parser

Related Topic