Natural Language Processing – Persisting Parsed Data

databasenatural-language-processingparsingpersistence

I've recently started experimenting with natural language processing (NLP) using Stanford's CoreNLP, and I'm wondering what are some of the standard ways to store NLP parsed data for something like a text mining application?

One way I thought might be interesting is to store the children as an adjacency list and make good use of recursive queries (Postgres supports this and I've found it works really well).

But I assume there are probably many standard ways to do this depending on what kind of analysis is being done that have been adopted by people working in the field over the years. So what are the standard persistence strategies for NLP parsed data and how are they used?

Best Answer

I once worked with an NLP toolkit and ran into the problem you described. I think there are (at least) two approaches:

  • (implicit approach), use memoization

    In programming languages where functions are first-class objects (such as Lua, Python, or Perl 1), automatic memoization can be implemented by replacing (at run-time) a function with its calculated value once a value has been calculated for a given set of parameters.

    This was the approach I used and it could be implemented quickly; drawback was, that certain larger data structures would be persisted on disk, and while loading was orders of magnitude faster than recalculation, it still took its time.

  • (explicit), use some database, be it relational or document-oriented, to store all the results you might care about in the future. This requires more attention in the beginning, but would pay off in the longer run.

Maybe of interest:


Edit: Another thing that I've been using lately for multistep long running computations is a workflow framework, of which there are dozens. It is not really about persistence, but persistence is a step in the workflow. I'm trying luigi for that and it comes, e.g. with Hadoop and Postgres helper classes, which can eliminate a lot of boilerplate code.

Related Topic