Regex – What are the differences between PEGs and CFGs

context-free-grammarlanguage-agnosticparsingpegregex

From this wikipedia page:

The fundamental difference between
context-free grammars and parsing
expression grammars is that the PEG's
choice operator is ordered. If the
first alternative succeeds, the second
alternative is ignored. Thus ordered
choice is not commutative, unlike
unordered choice as in context-free
grammars and regular expressions.
Ordered choice is analogous to soft
cut operators available in some logic
programming languages.

Why does PEG's choice operator short circuits the matching? Is it because to minimize memory usage (due to memoization)?

I'm not sure what the choice operator is in regular expressions but let's suppose it is this: /[aeiou]/ to match a vowel. So this regex is commutative because I could have written it in any of the 5! (five factorial) permutations of the vowel characters? i.e. /[aeiou]/ behaves the same as /[eiaou]/. What is the advantage of it being commutative? (c.f. PEG's non-commutativity)

The consequence is that if a CFG is
transliterated directly to a PEG, any
ambiguity in the former is resolved by
deterministically picking one parse
tree from the possible parses. By
carefully choosing the order in which
the grammar alternatives are
specified, a programmer has a great
deal of control over which parse tree
is selected.

Is this saying that PEG's grammar is superior to CFG's?

Best Answer

A CFG grammar is non-deterministic, meaning that some input could result in two or more possible parse-trees. Though most CFG-based parser-generators have restrictions on the determinability of the grammar. It will give a warning or error if it has two or more choices.

A PEG grammar is deterministic, meaning that any input can only be parsed one way.

To take a classic example; The grammar

if_statement := "if" "(" expr ")" statement "else" statement
              | "if" "(" expr ")" statement;

applied to the input

if (x1) if (x2) y1 else y2

could either be parsed as

if_statement(x1, if_statement(x2, y1, y2))

or

if_statement(x1, if_statement(x2, y1), y2)

A CFG-parser would generate a Shift/Reduce-conflict, since it can't decide if it should shift (read another token), or reduce (complete the node), when reaching the "else" keyword. Of course, there are ways to get around this problem.

A PEG-parser would always pick the first choice.

Which one is better is for you to decide. My opinion is that often PEG-grammars is easier to write, and CFG grammars easier to analyze.