How should I specify a grammar for a parser

grammarlanguage-designparsing

I have been programming for many years, but one task that still takes me inordinately long is to specify a grammar for a parser, and even after this excessive effort, I'm never sure that the grammar I've come up with is good (by any reasonable measure of "good").

I don't expect that there is an algorithm for automating the process of specifying a grammar, but I hope that there are ways to structure the problem that eliminate much of the guesswork and trial-and-error of my current approach.

My first thought has been to read about parsers, and I've done some of this, but everything I've read on this subject takes the grammar as a given (or trivial enough that one can specify it by inspection), and focuses on the problem of translating this grammar into a parser. I'm interested in the problem immediately before: how to specify the grammar in the first place.

I'm primarily interested in the problem of specifying a grammar that formally represents a collection of concrete examples (positive and negative). This is different from the problem of designing a new syntax. Thanks to Macneil for pointing out this distinction.

I had never really appreciated the distinction between a grammar and a syntax, but now that I'm beginning to see it, I could sharpen my first clarification by saying that I'm primarily interested in the problem of specifying a grammar that will enforce a predefined syntax: it just so happens that in my case, the basis for this syntax is usually a collection of positive and negative examples.

How is the grammar specified for a parser? Is there a book or reference out there that's the de-facto standard for describing best practices, design methodologies, and other helpful information about specifying a grammar for a parser? What points, when reading about parser grammar, should I be focusing on?

Best Answer

From the sample files you will need to make decisions based on how much you want to generalize from those examples. Suppose you had the following three samples: (each is a separate file)

f() {}
f(a,b) {b+a}
int x = 5;

You could trivially specify two grammars that will accept these samples:

Trivial Grammar One:

start ::= f() {} | f(a,b) {b+a} | int x = 5;

Trivial Grammar Two:

start ::= tokens
tokens ::= token tokens | <empty>
token ::= identifier | literal | { | } | ( | ) | , | + | = | ;

The first one is trivial because it accepts only the three samples. The second one is trivial because it accepts everything that could possibly use those token types. [For this discussion I'm going to assume that you aren't concerned about the tokenizer design much: It's simple to assume identifiers, numbers, and punctuation as your tokens, and you could borrow any token set from any scripting language you'd like anyway.]

So, the procedure you'll need to follow is to start at the high level and decide "how many of each instance do I want to allow?" If a syntactic construct can make sense to repeat any number of times, such as methods in a class, you will want a rule with this form:

methods ::= method methods | empty

Which is better stated in EBNF as:

methods ::= {method}

It will probably be obvious when you only want zero or one instances (meaning that the construct is optional, as with the extends clause for a Java class), or when you want to allow one or more instances (as with a variable initializer in a declaration). You'll need to be mindful of issues like requiring a separator between elements (as with the , in an argument list), requiring a terminator after each element (as with the ; to separate statements), or requiring no separator or terminator (as the case with methods in a class).

If your language uses arithmetic expressions, it would be easy for you to copy from an existing language's precedence rules. It's best to stick to something well-known, like C's expressions rules, than going for something exotic, but only provided that all else is equal.

In addition to precedence issues (what gets parsed with each other) and repetition issues (how many of each element should occur, how are they separated?), you will also need to think about order: Must something always appear before another thing? If one thing is included, should another be excluded?

At this point, you may be tempted to grammatically enforce some rules, a rule such as if a Person's age is specified you don't want to allow their birthdate to be specified as well. While you can construct your grammar to do so, you may find it easier to enforce this with a "semantic check" pass after everything is parsed. This keeps the grammar simpler and, in my opinion, makes for better error messages for when the rule is violated.

Related Topic