A parser that returns a (partial) result before the whole input has been consumed is called an incremental parser. Incremental parsing can be difficult if there are local ambiguities in a grammar that are only decided later in the input. Another difficulty is feigning those parts of the parse tree that haven't been reached yet.
A parser that returns a forest of all possible parse trees – that is, returns a parse tree for each possible derivation of an ambiguous grammar – is called … I'm not sure if these things have a name yet. I know that the Marpa parser generator is capable of this, but any Earley or GLR based parser should be able to pull this off.
However, you don't seem to want any of that. You have a stream with multiple embedded documents, with garbage in between:
garbagegarbage{key:42}garbagegarbage[1,2,3]{id:0}garbage...
You seem to want a parser that skips over the garbage, and (lazily) yields a sequence of ASTs for each document. This could be considered to be an incremental parser in its most general sense. But you'd actually implement a loop like this:
while stream is not empty:
try:
yield parse_document(stream at current position)
except:
advance position in stream by 1 character or token
The parse_docment
function would then be a conventional, non-incremental parser. There is a minor difficulty of ensuring that you have read enough of the input stream for a successful parse. How this can be handled depends on the type of parser you are using. Possibilities include growing a buffer on certain parse errors, or using lazy tokenization.
Lazy tokenization is probably the most elegant solution due to your input stream. Instead of having a lexer phase produce a fixed list of tokens, the parser would lazily request the next token from a lexer callback[1]. The lexer would then consume as much of the stream as needed. This way, the parser can only fail when the real end of the stream is reached, or when a real parse error occurred (i.e. we started parsing while still in garbage).
[1] a callback-driven lexer is a good idea in other contexts as well, because this can avoid some problems with longest-token matching.
If you know what kind of documents you are searching for, you can optimize the skipping to stop only at promising locations. E.g. a JSON document always begins with the character {
or [
. Therefore, garbage is any string that does not contain these characters.
Best Answer
Large interaction data is usually serialised to a file. Write the data to a CSV file or write it to a database and read back from it. Journal the data when it crosses a certain limit. Saving it this way will keep your memory from crashing. Save it regularly after a span of time.
Lookup CIRCULAR List also.