Php – How to choose a proper parser generator for PHP

parsingPHPregular expressions

Some programmers avoid regexes in some situations (see this popular @nickf comment), perhaps using a parsing framework such as Lex/Yacc. Others prefer to stay within PHP, perhaps using regular expressions, as it avoids the need for another framework.

When must we use a "real" parser generator instead of coding a parser directly in PHP?

What is the best PHP tool-kit for parsing complex things and what factors can help me determine which is the best?

As I mentioned earlier, perhaps there is no best solution, but simply good practices to help select a solution.

NOTE

Read below only if you is a answer writer

My question was edited: now have good english and objectivity (!), thanks a lot for @FrustratedWithFormsDesigner, @gnat and @Matt. But, perhaps with @Matt editins, losted my point of view.

My point is not a "Regular expressions vs. framework" dichotomy (!).
I think (and see examples) that programming PHP, we have a good "tool kit", not only with regular expresions (see the power of preg_replace_callback), strings functions, etc., for simple or very specific parsing tasks; but also with XML Manipulation, for complex tasks! See de "parsing power" of process with DOM and/or XSLT…
I see also cases where I have a dilema about use this "PHP native tool kit", or install and learn about some external paser generator (plugged as a library, or a class, etc.).

Best Answer

Regular expressions vs. framework (in the sense of being separate from a language) is a false dichotomy. Counter-example: parser combinators. These can easily handle regular, context-free, and context-sensitive languages; nice libraries exist for them in many languages. Major advantages are that language integration allows them to snarf their host languages's features (such as type system, unit-testing framework, using classes/objects/functions to extend them, composability), and fewer limitations such as token lookahead; major disadvantages are that an implementation is tied to a specific language (duh), decent error messages are often difficult to generate, and they can easily be inefficient if care isn't taken.

Here are the factors that I look at when deciding what tool to use to parse something:

level of the language, and of the parsing tool, on the Chomsky hierarchy. Is it a regular language? Is it context-free? Is it context-sensitive? If it's context-sensitive, I'd better make sure my parsing tool can handle it.
testability of parsers, including those for sub-rules. It's pretty easy to build parsers that look right but are totally wrong. Being able to independently test sub-parsers makes it far, far easier to get the whole parser right. This goes hand-in-hand with composability: building big parsers by putting small parsers together.
whether the language is ambiguous. Some languages are ambiguous, and some tools can't deal with them. If there are multiple valid parses, I probably want to be notified of that, possibly by an error or multiple results, instead of just getting a single parse and thinking it's the only one.
control over parse results. What output do parsers produce -- a concrete parse tree? Are the nodes strongly typed or stringly typed? I find it very useful to have some control over how the parse tree is built, as it is being built; having to post-process the tree is often more painful and error-prone, in my experience. An example is an integer parser -- should it return an integer or a string?
efficiency. I don't know a whole lot about this, but there are definitely differences between different approaches, especially when there's a lot of backtracking.
error reporting. At the very least, parse errors should be reportable with position and current rule information; a trace can also be useful but I haven't really seen any decent solutions to this problem (although I bet there are some).
expressivity of the parsing tool. Does it make me jump through hoops to express simple things like a sequence of values separated by commas, with no trailing comma? I'm a big fan of BNF, but not being able to extend it is frustrating, and can make grammars look far more complicated, and thus harder to maintain, than they actually need to. Being able to capture patterns and abstract over them shouldn't be a luxury, but a requirement.

Option 1: parser generators, or 'you need to parse some language and you just want to get it working, dammit'

Say, you're asked to build a parser for some ancient data format NOW. Or you need your parser to be fast. Or you need your parser to be easily maintainable.

In these cases, you're probably best off using a parser generator. You don't have to fiddle around with the details, you don't have to get lots of complicated code to work properly, you just write out the grammar the input will adhere to, write some handling code and presto: instant parser.

The advantages are clear:

It's (usually) quite easy to write a specification, in particular if the input format isn't too weird (option 2 would be better if it is).
You end up with a very easily maintainable piece of work that is easily understood: a grammar definition usually flows a lot more natural than code.
The parsers generated by good parser generators are usually a lot faster than hand-written code. Hand-written code can be faster, but only if you know your stuff - this is why most widely used compilers use a hand-written recursive-descent parser.

There's one thing you have to be careful of with parser-generators: the can sometimes reject your grammars. For an overview of the different types of parsers and how they can bite you, you may want to start here. Here you can find an overview of a lot of implementations and the types of grammars they accept.

Option 2: hand-written parsers, or 'you want to build your own parser, and you care about being user-friendly'

Parser generators are nice, but they aren't very user (the end-user, not you) friendly. You typically can't give good error messages, nor can you provide error recovery. Perhaps your language is very weird and parsers reject your grammar or you need more control than the generator gives you.

In these cases, using a hand-written recursive-descent parser is probably the best. While getting it right may be complicated, you have complete control over your parser so you can do all kinds of nice stuff you can't do with parser generators, like error messages and even error recovery (try removing all the semicolons from a C# file: the C# compiler will complain, but will detect most other errors anyway regardless of the presence of semicolons).

Hand-written parsers also usually perform better than generated ones, assuming the quality of the parser is high enough. On the other hand, if you don't manage to write a good parser - usually due to (a combination of) lack of experience, knowledge or design - then performance is usually slower. For lexers the opposite is true though: generally generated lexers use table lookups, making them faster than (most) hand-written ones.

Education-wise, writing your own parser will teach you more than using a generator. You have to write more and more complicated code after all, plus you have to understand exactly how you parse a language. On the other hand, if you want to learn how to create your own language (so, get experience at language design), either option 1 or option 3 is preferable: if you're developing a language, it will probably change a lot, and option 1 and 3 give you an easier time with that.

Option 3: hand written parser generators, or 'you're trying to learn a lot from this project and you wouldn't mind ending up with a nifty piece of code you can re-use a lot'

This is the path I'm currently walking down: you write your own parser generator. While highly nontrivial, doing this will probably teach you the most.

To give you an idea what doing a project like this involves I'll tell you about my own progress.

The lexer generator

I created my own lexer generator first. I usually design software starting with how the code will be used, so I thought about how I wanted to be able to use my code and wrote this piece of code (it's in C#):

Lexer<CalculatorToken> calculatorLexer = new Lexer<CalculatorToken>(
    new List<StringTokenPair>()
    { // This is just like a lex specification:
      //                    regex   token
        new StringTokenPair("\\+",  CalculatorToken.Plus),
        new StringTokenPair("\\*",  CalculatorToken.Times),
        new StringTokenPair("(",    CalculatorToken.LeftParenthesis),
        new StringTokenPair(")",    CalculatorToken.RightParenthesis),
        new StringTokenPair("\\d+", CalculatorToken.Number),
    });

foreach (CalculatorToken token in
             calculatorLexer.GetLexer(new StringReader("15+4*10")))
{ // This will iterate over all tokens in the string.
    Console.WriteLine(token.Value);
}

// Prints:
// 15
// +
// 4
// *
// 10

The input string-token pairs are converted into a corresponding recursive structure describing the regular expressions they represent using the ideas of an arithmetic stack. This is then converted into a NFA (nondeterministic finite automaton), which is in turn converted into a DFA (deterministic finite automaton). You can then match strings against the DFA.

This way, you get a good idea how exactly lexers work. In addition, if you do it the right way the results from your lexer generator can be roughly as fast as professional implementations. You also don't lose any expressiveness compared to option 2, and not much expressiveness compared to option 1.

I implemented my lexer generator in just over 1600 lines of code. This code makes the above work, but it still generates the lexer on the fly every time you start the program: I'm going to add code to write it to disk at some point.

If you want to know how to write your own lexer, this is a good place to start.

The parser generator

You then write your parser generator. I refer to here again for an overview on the different kinds of parsers - as a rule of thumb, the more they can parse, the slower they are.

Speed not being an issue for me, I chose to implement an Earley parser. Advanced implementations of an Earley parser have been shown to be about twice as slow as other parser types.

In return for that speed hit, you get the ability to parse any kind of grammar, even ambiguous ones. This means you never need to worry about whether your parser has any left-recursion in it, or what a shift-reduce conflict is. You can also define grammars more easily using ambiguous grammars if it doesn't matter which parse tree is the result, such as that it doesn't matter whether you parse 1+2+3 as (1+2)+3 or as 1+(2+3).

This is what a piece of code using my parser generator can look like:

Lexer<CalculatorToken> calculatorLexer = new Lexer<CalculatorToken>(
    new List<StringTokenPair>()
    {
        new StringTokenPair("\\+",  CalculatorToken.Plus),
        new StringTokenPair("\\*",  CalculatorToken.Times),
        new StringTokenPair("(",    CalculatorToken.LeftParenthesis),
        new StringTokenPair(")",    CalculatorToken.RightParenthesis),
        new StringTokenPair("\\d+", CalculatorToken.Number),
    });

Grammar<IntWrapper, CalculatorToken> calculator
    = new Grammar<IntWrapper, CalculatorToken>(calculatorLexer);

// Declaring the nonterminals.
INonTerminal<IntWrapper> expr = calculator.AddNonTerminal<IntWrapper>();
INonTerminal<IntWrapper> term = calculator.AddNonTerminal<IntWrapper>();
INonTerminal<IntWrapper> factor = calculator.AddNonTerminal<IntWrapper>();

// expr will be our head nonterminal.
calculator.SetAsMainNonTerminal(expr);

// expr: term | expr Plus term;
calculator.AddProduction(expr, term.GetDefault());
calculator.AddProduction(expr,
                         expr.GetDefault(),
                         CalculatorToken.Plus.GetDefault(),
                         term.AddCode(
                         (x, r) => { x.Result.Value += r.Value; return x; }
                         ));

// term: factor | term Times factor;
calculator.AddProduction(term, factor.GetDefault());
calculator.AddProduction(term,
                         term.GetDefault(),
                         CalculatorToken.Times.GetDefault(),
                         factor.AddCode
                         (
                         (x, r) => { x.Result.Value *= r.Value; return x; }
                         ));

// factor: LeftParenthesis expr RightParenthesis
//         | Number;
calculator.AddProduction(factor,
                         CalculatorToken.LeftParenthesis.GetDefault(),
                         expr.GetDefault(),
                         CalculatorToken.RightParenthesis.GetDefault());
calculator.AddProduction(factor,
                         CalculatorToken.Number.AddCode
                         (
                         (x, s) => { x.Result = new IntWrapper(int.Parse(s));
                                     return x; }
                         ));

IntWrapper result = calculator.Parse("15+4*10");
// result == 55

(Note that IntWrapper is simply an Int32, except that C# requires it to be a class, hence I had to introduce a wrapper class)

I hope you see that the code above is very powerful: any grammar you can come up with can be parsed. You can add arbitrary bits of code in the grammar capable of performing lots of tasks. If you manage to get this all working, you can re-use the resulting code to do a lot of tasks very easily: just imagine building a command-line interpreter using this piece of code.

Unit tests for a csv parser

I just found https://github.com/maxogden/csv-spectrum:

A bunch of different CSV files to serve as an acid test for CSV parsing libraries. There are also JSON versions of the CSVs for verification purposes.

The goal of this repository is to capture test cases to represent the entire CSV spectrum.

NOTE

Best Answer

Related Solutions

Language Design – Parser Generator vs Custom Lexer and Parser

Option 1: parser generators, or 'you need to parse some language and you just want to get it working, dammit'

Option 2: hand-written parsers, or 'you want to build your own parser, and you care about being user-friendly'

Option 3: hand written parser generators, or 'you're trying to learn a lot from this project and you wouldn't mind ending up with a nifty piece of code you can re-use a lot'

Unit tests for a csv parser

Related Topic