Parsing Techniques – Name for This Type of Parser

parsing

Conventional parsers consume their entire input and produce a single parse tree. I'm looking for one that consumes a continuous stream and produces a parse forest [edit: see discussion in comments regarding why this use of that term may be unconventional]. My gut says that I can't be the first person to need (or think I need) such a parser, but I've searched off and on for months to no avail.

I recognize that I may be ensnared by the XY problem. My ultimate purpose is to parse a stream of text, ignoring most of it, and produce a stream of parse trees from the sections that are recognized.

So my question is conditional: if a class of parsers with these characteristics exists, what is it called? And if not, why not? What is the alternative? Perhaps I'm missing some way I can make conventional parsers do what I want.

Best Answer

A parser that returns a (partial) result before the whole input has been consumed is called an incremental parser. Incremental parsing can be difficult if there are local ambiguities in a grammar that are only decided later in the input. Another difficulty is feigning those parts of the parse tree that haven't been reached yet.

A parser that returns a forest of all possible parse trees – that is, returns a parse tree for each possible derivation of an ambiguous grammar – is called … I'm not sure if these things have a name yet. I know that the Marpa parser generator is capable of this, but any Earley or GLR based parser should be able to pull this off.

However, you don't seem to want any of that. You have a stream with multiple embedded documents, with garbage in between:

 garbagegarbage{key:42}garbagegarbage[1,2,3]{id:0}garbage...

You seem to want a parser that skips over the garbage, and (lazily) yields a sequence of ASTs for each document. This could be considered to be an incremental parser in its most general sense. But you'd actually implement a loop like this:

while stream is not empty:
  try:
    yield parse_document(stream at current position)
  except:
    advance position in stream by 1 character or token

The parse_docment function would then be a conventional, non-incremental parser. There is a minor difficulty of ensuring that you have read enough of the input stream for a successful parse. How this can be handled depends on the type of parser you are using. Possibilities include growing a buffer on certain parse errors, or using lazy tokenization.

Lazy tokenization is probably the most elegant solution due to your input stream. Instead of having a lexer phase produce a fixed list of tokens, the parser would lazily request the next token from a lexer callback^[1]. The lexer would then consume as much of the stream as needed. This way, the parser can only fail when the real end of the stream is reached, or when a real parse error occurred (i.e. we started parsing while still in garbage).

^{[1] a callback-driven lexer is a good idea in other contexts as well, because this can avoid some problems with longest-token matching.}

If you know what kind of documents you are searching for, you can optimize the skipping to stop only at promising locations. E.g. a JSON document always begins with the character { or [. Therefore, garbage is any string that does not contain these characters.

Best Answer

Related Solutions

Unit tests for a csv parser

C++ Parsing – Design a Parser for Handling Very Large Files

Related Topic