How to Get Lookahead Symbol When Constructing LR(1) NFA for Parser

parsing

I am reading an explanation (awesome "Parsing Techniques" by D.Grune and C.J.H.Jacobs; p.292 in the 2nd edition) about how to construct an LR(1) parser, and I am at the stage of building the initial NFA. What I don't understand is how to get/compute a lookahead symbol.

Here is the example from the book, the grammar:

S -> E
E -> E - T
E -> T
T -> ( E )
T -> n

n is terminal. The "weird" transitions for me are is the sequence:

1)   S -> . E        eof
2)   E -> . E - T    eof
3)   E -> . E - T    -
4)   E -> E . - T    -
5)   E -> E - . T    -

(Note: In the above table, the state numbers are in front and the lookahead symbol is at the end.)

What puzzles me is that transition from (4) to (5) means reading - token, right? So how is it that - is still a lookahead symbol and even more important why is it that eof is no longer a lookahead symbol? After all in an input such as n - n eof there is only one - symbol.

My naive thinking tells me (5) should be written as:

5)   E -> E - . T    - eof

And another thing — n is terminal. Why it is not used at all as a lookahead symbol? I mean — we expect to see - or (, it is ok, but lack of n means we are sure it won't appear in input?

Update: after more reading I am only more confused 😉 I.e. what is really a lookahead? Because I see such state as (p.292, 2nd column, 2nd row):

E -> E . - T      eof

Lookahead says eof but the incoming input says -. Isn't it a contradiction? And it is not only in this book.

Best Answer

A lookahead token is a character (or sequence of characters, it's a token after all) defined as either one of the terminals or those tokens which are in the FOLLOW set. Look at the possible transitions from Follow to the next FIRST and consider that the next token is possibly the eof token because of the nature of LR parsing. (it considers the whole next token and its inner unfoldings. hence bottom-up parsing.)