Parsing – How to Add Precedence to LALR Parser Like in YACC

parsing

Please note, I am asking about writing LALR parser, not writing rules for LALR parser.

What I need is…

…to mimic YACC precedence definitions. I don't know how it is implemented, and below I describe what I've done and read so far.

For now I have basic LALR parser written. Next step — adding precedence, so 2+3*4 could be parsed as 2+(3*4).

I've read about precedence parsers, however I don't see how to fit such model into LALR. I don't understand two points:

how to compute when insert parenthesis generator
how to compute how many parenthesis the generator should create

I insert generators when the symbols is taken from input and put at the stack, right? So let's say I have something like this (| denotes boundary between stack and input):

ID = 5 | + ..., at this point I add open, so it gives
ID = < 5 | + ..., then I read more input
ID = < 5 + | 5 ... and more
ID = < 5 + 5 | ; ... and more
ID = < 5 + 5 ; | ...

At this point I should have several reduce moves in normal LALR, but the open parenthesis does not match so I continue reading more input. Which does not make sense.

So this was when problem.

And about count, let's say I have such data < 2 + < 3 * 4 >. As human I can see that the last generator should create 2 parenthesis, but how to compute this? After all there could be two scenarios:

( 2 + ( 3 *4 )) — parenthesis is used to show the outcome of generator
or (2 + (( 3 * 4 ) ^ 5) because there was more input

Please note that in both cases before 3 was open generator, and after 4 there was close generator. However in both cases, after reading 4 I have to reduce, so I have to know what generator "creates".

Best Answer

Consider the following rules:

[1] E -> E + E
[2] E -> E * E
[3] E -> num

Without precedence, it would produce a shift/reduce error when encountering the second operator. The table looks something like this:

    num     +       *       $       E
0   S2      ---     ---     ---     1
1   ---     S3      S4      Accept  ---
2   ---     R3      R3      R3      ---
3   S2      ---     ---     ---     5
4   S2      ---     ---     ---     6
5   ---     R1/S3   R1/S4   R1      ---
6   ---     R2/S3   R2/S4   R2      ---

With precedence, you are saying which of "shift" and "reduce" you want in the case of conflicts.

If the first operator has a lower precedence, you shift.
If the first operator has a higher precedence, you reduce.
If the operators have equal precedences, and both are
- left associative, you reduce.
- right associative, you shift.
- otherwise, you fail.

Cleaning up the errors in the table, you get something like this:

    num     +       *       $       E
0   S2      ---     ---     ---     1
1   ---     S3      S4      Accept  ---
2   ---     R3      R3      R3      ---
3   S2      ---     ---     ---     5
4   S2      ---     ---     ---     6
5   ---     R1      S4      R1      ---
6   ---     R2      R2      R2      ---

Related Solutions

How should I specify a grammar for a parser

From the sample files you will need to make decisions based on how much you want to generalize from those examples. Suppose you had the following three samples: (each is a separate file)

f() {}
f(a,b) {b+a}
int x = 5;

You could trivially specify two grammars that will accept these samples:

Trivial Grammar One:

start ::= f() {} | f(a,b) {b+a} | int x = 5;

Trivial Grammar Two:

start ::= tokens
tokens ::= token tokens | <empty>
token ::= identifier | literal | { | } | ( | ) | , | + | = | ;

The first one is trivial because it accepts only the three samples. The second one is trivial because it accepts everything that could possibly use those token types. [For this discussion I'm going to assume that you aren't concerned about the tokenizer design much: It's simple to assume identifiers, numbers, and punctuation as your tokens, and you could borrow any token set from any scripting language you'd like anyway.]

So, the procedure you'll need to follow is to start at the high level and decide "how many of each instance do I want to allow?" If a syntactic construct can make sense to repeat any number of times, such as methods in a class, you will want a rule with this form:

methods ::= method methods | empty

Which is better stated in EBNF as:

methods ::= {method}

It will probably be obvious when you only want zero or one instances (meaning that the construct is optional, as with the extends clause for a Java class), or when you want to allow one or more instances (as with a variable initializer in a declaration). You'll need to be mindful of issues like requiring a separator between elements (as with the , in an argument list), requiring a terminator after each element (as with the ; to separate statements), or requiring no separator or terminator (as the case with methods in a class).

If your language uses arithmetic expressions, it would be easy for you to copy from an existing language's precedence rules. It's best to stick to something well-known, like C's expressions rules, than going for something exotic, but only provided that all else is equal.

In addition to precedence issues (what gets parsed with each other) and repetition issues (how many of each element should occur, how are they separated?), you will also need to think about order: Must something always appear before another thing? If one thing is included, should another be excluded?

At this point, you may be tempted to grammatically enforce some rules, a rule such as if a Person's age is specified you don't want to allow their birthdate to be specified as well. While you can construct your grammar to do so, you may find it easier to enforce this with a "semantic check" pass after everything is parsed. This keeps the grammar simpler and, in my opinion, makes for better error messages for when the rule is violated.

How to Get Lookahead Symbol When Constructing LR(1) NFA for Parser

A lookahead token is a character (or sequence of characters, it's a token after all) defined as either one of the terminals or those tokens which are in the FOLLOW set. Look at the possible transitions from Follow to the next FIRST and consider that the next token is possibly the eof token because of the nature of LR parsing. (it considers the whole next token and its inner unfoldings. hence bottom-up parsing.)

What I need is…

Best Answer

Related Solutions

How should I specify a grammar for a parser

How to Get Lookahead Symbol When Constructing LR(1) NFA for Parser

Related Topic