How to Get Lookahead Symbol When Constructing LR(1) NFA for Parser

parsing

I am reading an explanation (awesome "Parsing Techniques" by D.Grune and C.J.H.Jacobs; p.292 in the 2nd edition) about how to construct an LR(1) parser, and I am at the stage of building the initial NFA. What I don't understand is how to get/compute a lookahead symbol.

Here is the example from the book, the grammar:

S -> E
E -> E - T
E -> T
T -> ( E )
T -> n

n is terminal. The "weird" transitions for me are is the sequence:

1)   S -> . E        eof
2)   E -> . E - T    eof
3)   E -> . E - T    -
4)   E -> E . - T    -
5)   E -> E - . T    -

(Note: In the above table, the state numbers are in front and the lookahead symbol is at the end.)

What puzzles me is that transition from (4) to (5) means reading - token, right? So how is it that - is still a lookahead symbol and even more important why is it that eof is no longer a lookahead symbol? After all in an input such as n - n eof there is only one - symbol.

My naive thinking tells me (5) should be written as:

5)   E -> E - . T    - eof

And another thing — n is terminal. Why it is not used at all as a lookahead symbol? I mean — we expect to see - or (, it is ok, but lack of n means we are sure it won't appear in input?

Update: after more reading I am only more confused 😉 I.e. what is really a lookahead? Because I see such state as (p.292, 2nd column, 2nd row):

E -> E . - T      eof

Lookahead says eof but the incoming input says -. Isn't it a contradiction? And it is not only in this book.

Best Answer

A lookahead token is a character (or sequence of characters, it's a token after all) defined as either one of the terminals or those tokens which are in the FOLLOW set. Look at the possible transitions from Follow to the next FIRST and consider that the next token is possibly the eof token because of the nature of LR parsing. (it considers the whole next token and its inner unfoldings. hence bottom-up parsing.)

Related Solutions

Unit tests for a csv parser

I just found https://github.com/maxogden/csv-spectrum:

A bunch of different CSV files to serve as an acid test for CSV parsing libraries. There are also JSON versions of the CSVs for verification purposes.

The goal of this repository is to capture test cases to represent the entire CSV spectrum.

How should I specify a grammar for a parser

From the sample files you will need to make decisions based on how much you want to generalize from those examples. Suppose you had the following three samples: (each is a separate file)

f() {}
f(a,b) {b+a}
int x = 5;

You could trivially specify two grammars that will accept these samples:

Trivial Grammar One:

start ::= f() {} | f(a,b) {b+a} | int x = 5;

Trivial Grammar Two:

start ::= tokens
tokens ::= token tokens | <empty>
token ::= identifier | literal | { | } | ( | ) | , | + | = | ;

The first one is trivial because it accepts only the three samples. The second one is trivial because it accepts everything that could possibly use those token types. [For this discussion I'm going to assume that you aren't concerned about the tokenizer design much: It's simple to assume identifiers, numbers, and punctuation as your tokens, and you could borrow any token set from any scripting language you'd like anyway.]

So, the procedure you'll need to follow is to start at the high level and decide "how many of each instance do I want to allow?" If a syntactic construct can make sense to repeat any number of times, such as methods in a class, you will want a rule with this form:

methods ::= method methods | empty

Which is better stated in EBNF as:

methods ::= {method}

It will probably be obvious when you only want zero or one instances (meaning that the construct is optional, as with the extends clause for a Java class), or when you want to allow one or more instances (as with a variable initializer in a declaration). You'll need to be mindful of issues like requiring a separator between elements (as with the , in an argument list), requiring a terminator after each element (as with the ; to separate statements), or requiring no separator or terminator (as the case with methods in a class).

If your language uses arithmetic expressions, it would be easy for you to copy from an existing language's precedence rules. It's best to stick to something well-known, like C's expressions rules, than going for something exotic, but only provided that all else is equal.

In addition to precedence issues (what gets parsed with each other) and repetition issues (how many of each element should occur, how are they separated?), you will also need to think about order: Must something always appear before another thing? If one thing is included, should another be excluded?

At this point, you may be tempted to grammatically enforce some rules, a rule such as if a Person's age is specified you don't want to allow their birthdate to be specified as well. While you can construct your grammar to do so, you may find it easier to enforce this with a "semantic check" pass after everything is parsed. This keeps the grammar simpler and, in my opinion, makes for better error messages for when the rule is violated.

Best Answer

Related Solutions

Unit tests for a csv parser

How should I specify a grammar for a parser

Related Topic