Text search – big data problem

algorithmshadooplucene

I have a problem I was hoping I could get some advice on!

I have a LOT of text as input (about 20GB worth, not MASSIVE but big enough). This is just free text, unstructured.

I have a 'category list'. I want to process the text, and cross-reference the items in the category list, and output the categories for each match, e.g.

Input text

The quick brown fox ran over the lazy dog.

Category lookup

Colour

Red | Brown | Green

Speed

Slow | Quick | Lazy | Fast

Expected Output

Colour – Brown

Speed – Quick, Lazy

To add to the complexity of the problem, the source text probably doesn't match the categories exactly, e.g. there will have to be a fuzzy match algorithm of sorts applied here.

I want to use 'Big data' tech to solve this (whether or not it TRULY NEEDS big data isn't the question – it's a secondary objective).

My thoughts are to utilize Hadoop Map/Reduce with Lucene to do the fuzzy-matching.

What do you think? Am I way off base?

Thanks a lot – ANY advice appreciated!!

Duncan

Best Answer

I would recommend starting with Solr, then do your machine learning with Mahout and Hadoop. Solr will give you basic text analysis through word stemming, normalization (lower-casing), and tokenization. If you enable term vectors in the schema you can feed those directly into Mahout and experiment with the different algorithms there. A lot (maybe most) of Mahout's algorithms will work in a distributed manor on Hadoop, as well as in a pseudo-distributed manor locally while you're working.

Once you've got Mahout picking out the right features of your text you can then add them to the docs already in Solr and then do facet queries over them.

Related Solutions

How to Quickly Search Through a Very Large List of Strings/Records in a Database

Instead of putting your data inside the DB, you can keep them as a set of documents (text files) separately and keep the link (path/url etc.) in the DB.

This is essential because, SQL query by design will be very slow both in sub-string search as well as retrieval.

Now, your problem is formulated as, having to search the text files which contains the set of strings. There are two possibilities here.

Sub-string match If your text blobs is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every file to find best possible files that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside. But you may still make at least 100+ grep (worst case 2 million) before returning.
Indexed search. Here you are assuming that text contains set of words and search is limited to fixed word lengths. In this case, document is indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/

Most likely if you need "fixed words" as queries, the approach two will be very fast and effective.

Text Tokenization Techniques for Search Engines

Python has a great natural language toolkit, the NLTK. It supports word tokenisation out of the box:

>>> import nltk
>>> input = 'We have some great burritos!'
>>> tokens = nltk.word_tokenize(input)
>>> tokens
['We', 'have', 'some', 'great', 'burritos', '!']
>>> nltk.pos_tag(tokens)
[('We', 'PRP'), ('have', 'VBP'), ('some', 'DT'), ('great', 'JJ'), ('burritos', 'NNS'), ('!', '.')]

The last structure includes natural language tags, which allow you to drop words from consideration based on their classification. You probably want to focus on the JJ (adjective) and NN-prefixed (noun) tags.

From there on out you can apply stemming, and detect positive and negative adjectives.

I believe that for adjective classification, however, you'd need to create your own corpus from online resources such as this; the library does give you the tools for this.

Here is a stemming example using the Porter stemming algorithm:

>>> from nltk.stem.porter import PorterStemmer
>>> PorterStemmer().stem('burritos')
'burrito'

O'Reilly published a book on the library, now available online.

Related Topic