Recursive Descent vs. Generated Parsers – Efficiency

compiler-constructionrecursive-descent

How do hand-written recursive descent parsers (which are inevitably LL(k)) compare to generated LALR parsers in terms of performance?

I know that LALR parsers are able to handle far more grammars than LL(k); however it's my intention to write my parser by hand, and recursive descent seems the most appropriate choice. Is it possible to write any other kind by hand (reasonably readably) out of interest?

N.B. I am using a functional language with tail-call optimisation (F#), so [well-tailored] recursion won't be as much of an issue as in other languages.

Best Answer

I think a lot depends on the language you are trying to parse. Another part of performance which sometimes gets forgotten is the lexical analysis (scanning) part - it is significant for performance as it deals with characters rather than symbols. Recursive descent is a good first iteration at writing a parser, and it makes following the parsed language's logic quite natural. I think that if the parsed language fits (no left recursion) you should start with recursive descent. Choosing LALR for performance at this stage seems to be premature optimization. You can write a chart parser by hand, but I doubt this is what you mean. Writing an LALR parser by hand is possible but tedious.

Related Solutions

R – Difference between an LL and Recursive Descent parser

LL is usually a more efficient parsing technique than recursive-descent. In fact, a naive recursive-descent parser will actually be O(k^n) (where n is the input size) in the worst case. Some techniques such as memoization (which yields a Packrat parser) can improve this as well as extend the class of grammars accepted by the parser, but there is always a space tradeoff. LL parsers are (to my knowledge) always linear time.

On the flip side, you are correct in your intuition that recursive-descent parsers can handle a greater class of grammars than LL. Recursive-descent can handle any grammar which is LL(*) (that is, unlimited lookahead) as well as a small set of ambiguous grammars. This is because recursive-descent is actually a directly-encoded implementation of PEGs, or Parser Expression Grammar(s). Specifically, the disjunctive operator (a | b) is not commutative, meaning that a | b does not equal b | a. A recursive-descent parser will try each alternative in order. So if a matches the input, it will succede even if b would have matched the input. This allows classic "longest match" ambiguities like the dangling else problem to be handled simply by ordering disjunctions correctly.

With all of that said, it is possible to implement an LL(k) parser using recursive-descent so that it runs in linear time. This is done by essentially inlining the predict sets so that each parse routine determines the appropriate production for a given input in constant time. Unfortunately, such a technique eliminates an entire class of grammars from being handled. Once we get into predictive parsing, problems like dangling else are no longer solvable with such ease.

As for why LL would be chosen over recursive-descent, it's mainly a question of efficiency and maintainability. Recursive-descent parsers are markedly easier to implement, but they're usually harder to maintain since the grammar they represent does not exist in any declarative form. Most non-trivial parser use-cases employ a parser generator such as ANTLR or Bison. With such tools, it really doesn't matter if the algorithm is directly-encoded recursive-descent or table-driven LL(k).

As a matter of interest, it is also worth looking into recursive-ascent, which is a parsing algorithm directly encoded after the fashion of recursive-descent, but capable of handling any LALR grammar. I would also dig into parser combinators, which are a functional way of composing recursive-descent parsers together.

Recursive descent parser and functional programming

Answer derived from this blog article:

So my question is what would a more traditional functional approach to parsing (i.e. few side effects) look like?

Sounds like you need to separate functional (as in Lisp, Scheme, Standard ML, CAML, OCaml, F#) from purity (absence of side effects, as in Haskell) and incidental language features (algebraic datatypes, pattern matching).

Thanks to algebraic datatypes, pattern matching and higher-order functions, F# is a good for parsing and great for transformations and code generation but most production parsers written in F# are not pure. Historically, the family of languages F# is mostly derived from (the MetaLanguages, or MLs) were bred specifically for this kind of metaprogramming.

Here is a very simple set of mutually-recursive active patterns that parse and evaluate mathematical expressions composed of single digits, + - * operators and bracketed subexpressions:

> let rec (|Term|_|) = function
    | Factor(e1, t) ->
        let rec aux e1 = function
          | '+'::Factor(e2, t) -> aux (e1 + e2) t
          | '-'::Factor(e2, t) -> aux (e1 - e2) t
          | t -> Some(e1, t)
        aux e1 t
    | _ -> None
  and (|Factor|_|) = function
    | '-'::Factor(e, t) -> Some(-e, t)
    | Atom(e1, '*'::Factor(e2, t)) -> Some(e1 * e2, t)
    | Atom(e, t) -> Some(e, t)
    | _ -> None
  and (|Atom|_|) = function
    | c::t when '0'<=c && c<='9' -> Some(int(string c), t)
    | '('::Term(e, ')'::t) -> Some(e, t)
    | _ -> None;;
val ( |Term|_| ) : char list -> (int * char list) option
val ( |Factor|_| ) : char list -> (int * char list) option
val ( |Atom|_| ) : char list -> (int * char list) option

Here is an example of it being used to parse and evaluate an expression:

> let (Term e) = List.ofSeq "1+2*(3-4)*-5";;
val e : int * char list = (11, [])

That's a pure solution that's using pattern matching over lists with F#'s active patterns. In reality, you'll want to define a type for your abstract syntax tree and return a value of that type. This is really easy in F#:

type expr =
  | Int of int
  | Neg of expr
  | Add of expr * expr
  | Sub of expr * expr
  | Mul of expr * expr

  static member (~-) f = Neg f
  static member (+) (f, g) = Add(f, g)
  static member (-) (f, g) = Sub(f, g)
  static member (*) (f, g) = Mul(f, g)

let rec (|Term|_|) = function
  | Factor(e1, t) ->
      let rec aux e1 = function
        | '+'::Factor(e2, t) -> aux (e1 + e2) t
        | '-'::Factor(e2, t) -> aux (e1 - e2) t
        | t -> Some(e1, t)
      aux e1 t
  | _ -> None
and (|Factor|_|) = function
  | '-'::Factor(e, t) -> Some(-e, t)
  | Atom(e1, '*'::Factor(e2, t)) -> Some(e1 * e2, t)
  | Atom(e, t) -> Some(e, t)
  | _ -> None
and (|Atom|_|) = function
  | c::t when '0'<=c && c<='9' -> Some(Int(int(string c)), t)
  | '('::Term(e, ')'::t) -> Some(e, t)
  | _ -> None

let (Term e) = List.ofSeq "1+2*(3-4)*-5"

Note that only one minor change to the parser was required because the AST can also be constructed using the +, - and * operators.

Second, is it worthwhile to try and adopt a functional approach to parsing, or is it really on optimizations to intermediate code that functional languages shine and I just haven't gotten there yet?

You're talking about purity, not functional programming. Purity is not particularly useful in the context of parsing text and, in fact, can be a real hindrance (e.g. interning symbols is a nightmare in Haskell). However, F# has many other benefits that make it good for this set of problems. In particular, although other languages like OCaml have much better tools for parsing, I think F# is the best .NET language in this context.

That is, should I fuddle through the parsing in F# using an imperative style and switch to a more functional approach later on?

Depends entirely upon what you want to make functional. I'd use fslex and fsyacc with pure code to construct ASTs in the actions but impurities for anything like hash consing or generating unique IDs.

You may appreciate the following articles I have written on this subject at this blog (note paywall):

"Parsing text with Lex and Yacc" (30th September 2007).
"Optimizing a simple bytecode interpreter" (31st October 2007).
"Parser combinators" (30th November 2007).
"Language-oriented programming: The Term-level Interpreter" (31st December 2007).
"Language-oriented programming: Term Rewriting" (16th August 2008).
"Run-time code generation using System.Reflection.Emit" (31st August 2008).
"Parsing and visualizing binary Geographic Information System data" (30th November 2009).

Best Answer

Related Solutions

R – Difference between an LL and Recursive Descent parser

Recursive descent parser and functional programming

Related Topic