Php – How to choose a proper parser generator for PHP

parsingPHPregular expressions

Some programmers avoid regexes in some situations (see this popular @nickf comment), perhaps using a parsing framework such as Lex/Yacc. Others prefer to stay within PHP, perhaps using regular expressions, as it avoids the need for another framework.

When must we use a "real" parser generator instead of coding a parser directly in PHP?

What is the best PHP tool-kit for parsing complex things and what factors can help me determine which is the best?

As I mentioned earlier, perhaps there is no best solution, but simply good practices to help select a solution.


NOTE

Read below only if you is a answer writer

My question was edited: now have good english and objectivity (!), thanks a lot for @FrustratedWithFormsDesigner, @gnat and @Matt. But, perhaps with @Matt editins, losted my point of view.

  • My point is not a "Regular expressions vs. framework" dichotomy (!).

  • I think (and see examples) that programming PHP, we have a good "tool kit", not only with regular expresions (see the power of preg_replace_callback), strings functions, etc., for simple or very specific parsing tasks; but also with XML Manipulation, for complex tasks! See de "parsing power" of process with DOM and/or XSLT

  • I see also cases where I have a dilema about use this "PHP native tool kit", or install and learn about some external paser generator (plugged as a library, or a class, etc.).

Best Answer

Regular expressions vs. framework (in the sense of being separate from a language) is a false dichotomy. Counter-example: parser combinators. These can easily handle regular, context-free, and context-sensitive languages; nice libraries exist for them in many languages. Major advantages are that language integration allows them to snarf their host languages's features (such as type system, unit-testing framework, using classes/objects/functions to extend them, composability), and fewer limitations such as token lookahead; major disadvantages are that an implementation is tied to a specific language (duh), decent error messages are often difficult to generate, and they can easily be inefficient if care isn't taken.

Here are the factors that I look at when deciding what tool to use to parse something:

  • level of the language, and of the parsing tool, on the Chomsky hierarchy. Is it a regular language? Is it context-free? Is it context-sensitive? If it's context-sensitive, I'd better make sure my parsing tool can handle it.

  • testability of parsers, including those for sub-rules. It's pretty easy to build parsers that look right but are totally wrong. Being able to independently test sub-parsers makes it far, far easier to get the whole parser right. This goes hand-in-hand with composability: building big parsers by putting small parsers together.

  • whether the language is ambiguous. Some languages are ambiguous, and some tools can't deal with them. If there are multiple valid parses, I probably want to be notified of that, possibly by an error or multiple results, instead of just getting a single parse and thinking it's the only one.

  • control over parse results. What output do parsers produce -- a concrete parse tree? Are the nodes strongly typed or stringly typed? I find it very useful to have some control over how the parse tree is built, as it is being built; having to post-process the tree is often more painful and error-prone, in my experience. An example is an integer parser -- should it return an integer or a string?

  • efficiency. I don't know a whole lot about this, but there are definitely differences between different approaches, especially when there's a lot of backtracking.

  • error reporting. At the very least, parse errors should be reportable with position and current rule information; a trace can also be useful but I haven't really seen any decent solutions to this problem (although I bet there are some).

  • expressivity of the parsing tool. Does it make me jump through hoops to express simple things like a sequence of values separated by commas, with no trailing comma? I'm a big fan of BNF, but not being able to extend it is frustrating, and can make grammars look far more complicated, and thus harder to maintain, than they actually need to. Being able to capture patterns and abstract over them shouldn't be a luxury, but a requirement.