I am reading an explanation (awesome "Parsing Techniques" by D.Grune and C.J.H.Jacobs; p.292 in the 2nd edition) about how to construct an LR(1) parser, and I am at the stage of building the initial NFA. What I don't understand is how to get/compute a lookahead symbol.
Here is the example from the book, the grammar:
S -> E
E -> E - T
E -> T
T -> ( E )
T -> n
n
is terminal. The "weird" transitions for me are is the sequence:
1) S -> . E eof
2) E -> . E - T eof
3) E -> . E - T -
4) E -> E . - T -
5) E -> E - . T -
(Note: In the above table, the state numbers are in front and the lookahead symbol is at the end.)
What puzzles me is that transition from (4) to (5) means reading -
token, right? So how is it that -
is still a lookahead symbol and even more important why is it that eof
is no longer a lookahead symbol? After all in an input such as n - n eof
there is only one -
symbol.
My naive thinking tells me (5) should be written as:
5) E -> E - . T - eof
And another thing — n
is terminal. Why it is not used at all as a lookahead symbol? I mean — we expect to see -
or (
, it is ok, but lack of n
means we are sure it won't appear in input?
Update: after more reading I am only more confused 😉 I.e. what is really a lookahead? Because I see such state as (p.292, 2nd column, 2nd row):
E -> E . - T eof
Lookahead says eof
but the incoming input says -
. Isn't it a contradiction? And it is not only in this book.
Best Answer
A lookahead token is a character (or sequence of characters, it's a token after all) defined as either one of the terminals or those tokens which are in the FOLLOW set. Look at the possible transitions from Follow to the next FIRST and consider that the next token is possibly the eof token because of the nature of LR parsing. (it considers the whole next token and its inner unfoldings. hence bottom-up parsing.)