How should I specify a grammar for a parser

grammarlanguage-designparsing

I have been programming for many years, but one task that still takes me inordinately long is to specify a grammar for a parser, and even after this excessive effort, I'm never sure that the grammar I've come up with is good (by any reasonable measure of "good").

I don't expect that there is an algorithm for automating the process of specifying a grammar, but I hope that there are ways to structure the problem that eliminate much of the guesswork and trial-and-error of my current approach.

My first thought has been to read about parsers, and I've done some of this, but everything I've read on this subject takes the grammar as a given (or trivial enough that one can specify it by inspection), and focuses on the problem of translating this grammar into a parser. I'm interested in the problem immediately before: how to specify the grammar in the first place.

I'm primarily interested in the problem of specifying a grammar that formally represents a collection of concrete examples (positive and negative). This is different from the problem of designing a new syntax. Thanks to Macneil for pointing out this distinction.

I had never really appreciated the distinction between a grammar and a syntax, but now that I'm beginning to see it, I could sharpen my first clarification by saying that I'm primarily interested in the problem of specifying a grammar that will enforce a predefined syntax: it just so happens that in my case, the basis for this syntax is usually a collection of positive and negative examples.

How is the grammar specified for a parser? Is there a book or reference out there that's the de-facto standard for describing best practices, design methodologies, and other helpful information about specifying a grammar for a parser? What points, when reading about parser grammar, should I be focusing on?

Best Answer

From the sample files you will need to make decisions based on how much you want to generalize from those examples. Suppose you had the following three samples: (each is a separate file)

f() {}
f(a,b) {b+a}
int x = 5;

You could trivially specify two grammars that will accept these samples:

Trivial Grammar One:

start ::= f() {} | f(a,b) {b+a} | int x = 5;

Trivial Grammar Two:

start ::= tokens
tokens ::= token tokens | <empty>
token ::= identifier | literal | { | } | ( | ) | , | + | = | ;

The first one is trivial because it accepts only the three samples. The second one is trivial because it accepts everything that could possibly use those token types. [For this discussion I'm going to assume that you aren't concerned about the tokenizer design much: It's simple to assume identifiers, numbers, and punctuation as your tokens, and you could borrow any token set from any scripting language you'd like anyway.]

So, the procedure you'll need to follow is to start at the high level and decide "how many of each instance do I want to allow?" If a syntactic construct can make sense to repeat any number of times, such as methods in a class, you will want a rule with this form:

methods ::= method methods | empty

Which is better stated in EBNF as:

methods ::= {method}

It will probably be obvious when you only want zero or one instances (meaning that the construct is optional, as with the extends clause for a Java class), or when you want to allow one or more instances (as with a variable initializer in a declaration). You'll need to be mindful of issues like requiring a separator between elements (as with the , in an argument list), requiring a terminator after each element (as with the ; to separate statements), or requiring no separator or terminator (as the case with methods in a class).

If your language uses arithmetic expressions, it would be easy for you to copy from an existing language's precedence rules. It's best to stick to something well-known, like C's expressions rules, than going for something exotic, but only provided that all else is equal.

In addition to precedence issues (what gets parsed with each other) and repetition issues (how many of each element should occur, how are they separated?), you will also need to think about order: Must something always appear before another thing? If one thing is included, should another be excluded?

At this point, you may be tempted to grammatically enforce some rules, a rule such as if a Person's age is specified you don't want to allow their birthdate to be specified as well. While you can construct your grammar to do so, you may find it easier to enforce this with a "semantic check" pass after everything is parsed. This keeps the grammar simpler and, in my opinion, makes for better error messages for when the rule is violated.

Parsing

Parsing is the process of analyzing of a string of symbols according to a formal grammar. (In Computing Science,) parsing is used to be able to let a computer understand text written in a language, usually creating a parse tree that represents the written text, storing the meaning of the different written parts in each node of the tree. This parse tree can then be used for a variety of different purposes, such as translating it to another language (used in many compilers), interpreting the written instructions directly in some way (SQL, HTML), allowing tools like Linters to do their work, etc. Sometimes, a parse tree is not explicitly generated, but rather the action that should be performed at each type of node in the tree is executed directly. This increases efficiency, but underwater still an implicit parse tree exists.

Parsing is a problem that is computationally difficult. There has been over fifty years of research on this subject, but there is still much to learn.

Roughly speaking, there are four general algorithms to let a computer parse input:

LL parsing. (Context-free, top-down parsing.)
LR parsing. (Context-free, bottom-up parsing.)
PEG + Packrat parsing.
Earley Parsing.

Note that these types of parsing are very general, theoretical descriptions. There are multiple ways to implement each of these algorithms on physical machines, with different tradeoffs.

LL and LR can only look at Context-Free grammars (that is; the context around the tokens that are written is not important to understand how they are used).

PEG/Packrat parsing and Earley parsing are used a lot less: Earley-parsing is nice in that it can handle a whole lot more grammars (including those that are not necessarily Context-Free) but it is less efficient (as claimed by the dragon book (section 4.1.1); I am not sure if these claims are still accurate). Parsing Expression Grammar + Packrat-parsing is a method that is relatively efficient and can also handle more grammars than both LL and LR, but hides ambiguities, as will quickly be touched on below.

LL (Left-to-right, Leftmost derivation)

This is possibly the most natural way to think about parsing. The idea is to look at the next token in the input string and then decide which one of maybe multiple possible recursive calls should be taken to generate a tree structure.

This tree is built 'top-down', meaning that we start at the root of the tree, and travel the grammar rules in the same way as we travel through the input string. It can also be seen as constructing a 'postfix' equivalent for the 'infix' token stream that is being read.

Parsers performing LL-style parsing can be written to look very much like the original grammar that was specified. This makes it relatively easy to understand, debug and enhance them. Classical Parser Combinators are nothing more than 'lego pieces' that can be put together to build an LL-style parser.

LR (Left-to-right, Rightmost derivation)

LR parsing travels the other way, bottom-up: At each step, the top element(s) on the stack are compared to the list of grammar, to see if they could be reduced to a higher-level rule in the grammar. If not, the next token from the input stream is shift ed and placed on top of the stack.

A program is correct if at the end we end up with a single node on the stack which represents the starting rule from our grammar.

Lookahead

In either of these two systems, it sometimes is necessary to peek at more tokens from the input before being able to decide which choice to make. This is the (0), (1), (k) or (*)-syntax you see after the names of these two general algorithms, such as LR(1) or LL(k). k usually stands for 'as much as your grammar needs', while * usually stands for 'this parser performs backtracking', which is more powerful/easy to implement, but has a much higher memory and time usage than a parser that can just keep on parsing linearly.

Note that LR-style parsers already have many tokens on the stack when they might decide to 'look ahead', so they already have more information to dispatch on. This means that they often need less 'lookahead' than an LL-style parser for the same grammar.

LL vs. LR: Ambiguity

When reading the two descriptions above, one might wonder why LR-style parsing exists, as LL-style parsing seems a lot more natural.

However, LL-style parsing has a problem: Left Recursion.

It is very natural to write a grammar like:

expr ::= expr '+' expr | term
term ::= integer | float

But, a LL-style parser will get stuck in an infinite recursive loop when parsing this grammar: When trying out the left-most possibility of the expr rule, it recurses to this rule again without consuming any input.

There are ways to resolve this problem. The simplest is to rewrite your grammar so that this kind of recursion does not happen any more:

expr ::= term expr_rest
expr_rest ::= '+' expr | ϵ
term ::= integer | float

(Here, ϵ stands for the 'empty string')

This grammar now is right recursive. Note that it immediately is a lot more difficult to read.

In practice, left-recursion might happen indirectly with many other steps in-between. This makes it a hard problem to look out for. But trying to solve it makes your grammar harder to read.

As Section 2.5 of the Dragon Book states:

We appear to have a conflict: on the one hand we need a grammar that facilitates translation, on the other hand we need a significantly different grammar that facilitates parsing. The solution is to begin with the grammar for easy translation and carefully transform it to facilitate parsing. By eliminating the left recursion we can obtain a grammar suitable for use in a predictive recursive-descent translator.

LR-style parsers do not have the problem of this left-recursion, as they build the tree from the bottom-up. However, the mental translation of a grammar like above to an LR-style parser (which is often implemented as a Finite-State Automaton)
is very hard (and error-prone) to do, as often there are hundreds or thousands of states + state transitions to consider. This is why LR-style parsers are usually generated by a Parser Generator, which is also known as a 'compiler compiler'.

How to resolve Ambiguities

We saw two methods to resolve Left-recursion ambiguities above: 1) rewrite the syntax 2) use an LR-parser.

But there are other kinds of ambiguities which are harder to solve: What if two different rules are equally applicable at the same time?

Some common examples are:

arithmetic expressions.
the Dangling Else

Both LL-style and LR-style parsers have problems with these. Problems with parsing arithmetic expressions can be solved by introducing operator precedence. In a similar way, other problems like the Dangling Else can be solved, by picking one precedence behaviour and sticking with it. (In C/C++, for instance, the dangling else always belongs to the closest 'if').

Another 'solution' to this is to use Parser Expression Grammar (PEG): This is similar to the BNF-grammar used above, but in the case of an ambiguity, always 'pick the first'. Of course, this does not really 'solve' the problem, but rather hide that an ambiguity actually exists: The end users might not know which choice the parser makes, and this might lead to unexpected results.

More information that is a whole lot more in-depth than this post, including why it is impossible in general to know if your grammar does not have any ambiguities and the implications of this is the wonderful blog article LL and LR in context: Why parsing tools are hard. I can highly recommend it; it helped me a lot to understand all the things I am talking about right now.

50 years of research

But life goes on. It turned out that 'normal' LR-style parsers implemented as finite state automatons often needed thousands of states + transitions, which was a problem in program size. So, variants such as Simple LR (SLR) and LALR (Look-ahead LR) were written that combine other techniques to make the automaton smaller, reducing the disk and memory footprint of the parser programs.

Also, another way to resolve the ambiguities listed above is to use generalized techniques in which, in the case of an ambiguity, both possibilities are kept and parsed: Either one might fail to parse down the line (in which case the other possibility is the 'correct' one), as well as returning both (and in this way showing that an ambiguity exists) in the case they both are correct.

Interestingly, after the Generalized LR algorithm was described, it turned out that a similar approach could be used to implement Generalized LL parsers, which is similarly fast ( $O(n^3)$ time complexity for ambiguous grammars, $ O(n) $ for completely unambiguous grammars, albeit with more bookkeeping than a simple (LA)LR parser, which means a higher constant-factor) but again allow a parser to be written in recursive descent (top-down) style that is a lot more natural to write and debug.

Parser Combinators, Parser Generators

So, with this long exposition, we are now arriving at the core of the question:

What is the difference of Parser Combinators and Parser Generators, and when should one be used over the other?

They are really different kinds of beasts:

Parser Combinators were created because people were writing top-down parsers and realized that many of these had a lot in common.

Parser Generators were created because people were looking to build parsers that did not have the problems that LL-style parsers had (i.e. LR-style parsers), which proved very hard to do by hand. Common ones include Yacc/Bison, that implement (LA)LR).

Interestingly, nowadays the landscape is muddled somewhat:

It is possible to write Parser Combinators that work with the GLL algorithm, resolving the ambiguity-issues that classical LL-style parsers had, while being just as readable/understandable as all kinds of top-down parsing.
Parser Generators can also be written for LL-style parsers. ANTLR does exactly that, and uses other heuristics (Adaptive LL(*)) to resolve the ambiguities that classical LL-style parsers had.

In general, creating an LR parser generator and and debugging the output of an (LA)LR-style parser generator running on your grammar is difficult, because of the translation of your original grammar to the 'inside-out' LR form. On the other hand, tools like Yacc/Bison have had many years of optimisations, and seen a lot of use in the wild, which means that many people now consider it as the way to do parsing and are sceptical towards new approaches.

Which one you should use, depends on how hard your grammar is, and how fast the parser needs to be. Depending on the grammar, one of these techniques (/implementations of the different techniques) might be faster, have a smaller memory footprint, have a smaller disk footprint, or be more extensible or easier to debug than the others. Your Mileage May Vary.

Side note: On the subject of Lexical Analysis.

Lexical Analysis can be used both for Parser Combinators and Parser Generators. The idea is to have a 'dumb' parser that is very easy to implement (and therefore fast) that performs a first pass over your source code, removing for instance repeating white spaces, comments, etc, and possibly 'tokenizing' in a very coarse way the different elements that make up your language.

The main advantage is that this first step makes the real parser a lot simpler (and because of that possibly faster). The main disadvantage is that you have a separate translation step, and e.g. error reporting with line- and column numbers becomes harder because of the removal of white-space.

A lexer in the end is 'just' another parser and can be implemented using any of the techniques above. Because of its simplicity, often other techniques are used than for the main parser, and for instance extra 'lexer generators' exist.

Tl;Dr:

Here is a flowchart that is applicable to most cases: