C++ Parsing – Design a Parser for Handling Very Large Files

cparsing

I have written a program which records protocol messages between an application and a hardware device which matches each application request with each hardware response. This is so that I can later remove the hardware, connect a 'replay' application to the main application and wait for an application request and reply with a matched copy of the requisite hardware reply message.

My replay application saves the matched request/response in a list (using C++ std::list).

This works fine on a small interaction session. My problem now is that I need to be able to use the replay over a long long session. With my current implementation, the replay program eventually uses up all available memory on my computer and crashes.

So I need some sort of lookahead – and not parse the whole session in one go.

Can anyone make any suggestions on how to get started?

Best Answer

Large interaction data is usually serialised to a file. Write the data to a CSV file or write it to a database and read back from it. Journal the data when it crosses a certain limit. Saving it this way will keep your memory from crashing. Save it regularly after a span of time.

Lookup CIRCULAR List also.

Related Solutions

Unit tests for a csv parser

I just found https://github.com/maxogden/csv-spectrum:

A bunch of different CSV files to serve as an acid test for CSV parsing libraries. There are also JSON versions of the CSVs for verification purposes.

The goal of this repository is to capture test cases to represent the entire CSV spectrum.

Parsing Techniques – Name for This Type of Parser

A parser that returns a (partial) result before the whole input has been consumed is called an incremental parser. Incremental parsing can be difficult if there are local ambiguities in a grammar that are only decided later in the input. Another difficulty is feigning those parts of the parse tree that haven't been reached yet.

A parser that returns a forest of all possible parse trees – that is, returns a parse tree for each possible derivation of an ambiguous grammar – is called … I'm not sure if these things have a name yet. I know that the Marpa parser generator is capable of this, but any Earley or GLR based parser should be able to pull this off.

However, you don't seem to want any of that. You have a stream with multiple embedded documents, with garbage in between:

 garbagegarbage{key:42}garbagegarbage[1,2,3]{id:0}garbage...

You seem to want a parser that skips over the garbage, and (lazily) yields a sequence of ASTs for each document. This could be considered to be an incremental parser in its most general sense. But you'd actually implement a loop like this:

while stream is not empty:
  try:
    yield parse_document(stream at current position)
  except:
    advance position in stream by 1 character or token

The parse_docment function would then be a conventional, non-incremental parser. There is a minor difficulty of ensuring that you have read enough of the input stream for a successful parse. How this can be handled depends on the type of parser you are using. Possibilities include growing a buffer on certain parse errors, or using lazy tokenization.

Lazy tokenization is probably the most elegant solution due to your input stream. Instead of having a lexer phase produce a fixed list of tokens, the parser would lazily request the next token from a lexer callback^[1]. The lexer would then consume as much of the stream as needed. This way, the parser can only fail when the real end of the stream is reached, or when a real parse error occurred (i.e. we started parsing while still in garbage).

^{[1] a callback-driven lexer is a good idea in other contexts as well, because this can avoid some problems with longest-token matching.}

If you know what kind of documents you are searching for, you can optimize the skipping to stop only at promising locations. E.g. a JSON document always begins with the character { or [. Therefore, garbage is any string that does not contain these characters.

Best Answer

Related Solutions

Unit tests for a csv parser

Parsing Techniques – Name for This Type of Parser

Related Topic