Text search – big data problem

algorithmshadooplucene

I have a problem I was hoping I could get some advice on!

I have a LOT of text as input (about 20GB worth, not MASSIVE but big enough). This is just free text, unstructured.

I have a 'category list'. I want to process the text, and cross-reference the items in the category list, and output the categories for each match, e.g.

Input text

The quick brown fox ran over the lazy dog.

Category lookup

Colour

Red | Brown | Green

Speed

Slow | Quick | Lazy | Fast

Expected Output

Colour – Brown

Speed – Quick, Lazy

To add to the complexity of the problem, the source text probably doesn't match the categories exactly, e.g. there will have to be a fuzzy match algorithm of sorts applied here.

I want to use 'Big data' tech to solve this (whether or not it TRULY NEEDS big data isn't the question – it's a secondary objective).

My thoughts are to utilize Hadoop Map/Reduce with Lucene to do the fuzzy-matching.

What do you think? Am I way off base?

Thanks a lot – ANY advice appreciated!!

Duncan

Best Answer

I would recommend starting with Solr, then do your machine learning with Mahout and Hadoop. Solr will give you basic text analysis through word stemming, normalization (lower-casing), and tokenization. If you enable term vectors in the schema you can feed those directly into Mahout and experiment with the different algorithms there. A lot (maybe most) of Mahout's algorithms will work in a distributed manor on Hadoop, as well as in a pseudo-distributed manor locally while you're working.

Once you've got Mahout picking out the right features of your text you can then add them to the docs already in Solr and then do facet queries over them.

Related Topic