Text Tokenization Techniques for Search Engines

lucenepythonsearchsearch-engine

We are looking for a way to tokenize some text in the same or similar way as a search engine would do it.

The reason we are doing this is so that we can run some statistical analysis on the tokens. The language we are using is python, so would prefer a technique that works in that language, but could probably set something up to use another language if necessary.

Example

Original token:

We have some great burritos!

More simplified: (remove plurals and punctuation)

We have some great burrito

Even more simplified: (remove superfluous words)

great burrito

Best: (recognize positive and negative meaning):

burrito -positive-

Best Answer

Python has a great natural language toolkit, the NLTK. It supports word tokenisation out of the box:

>>> import nltk
>>> input = 'We have some great burritos!'
>>> tokens = nltk.word_tokenize(input)
>>> tokens
['We', 'have', 'some', 'great', 'burritos', '!']
>>> nltk.pos_tag(tokens)
[('We', 'PRP'), ('have', 'VBP'), ('some', 'DT'), ('great', 'JJ'), ('burritos', 'NNS'), ('!', '.')]

The last structure includes natural language tags, which allow you to drop words from consideration based on their classification. You probably want to focus on the JJ (adjective) and NN-prefixed (noun) tags.

From there on out you can apply stemming, and detect positive and negative adjectives.

I believe that for adjective classification, however, you'd need to create your own corpus from online resources such as this; the library does give you the tools for this.

Here is a stemming example using the Porter stemming algorithm:

>>> from nltk.stem.porter import PorterStemmer
>>> PorterStemmer().stem('burritos')
'burrito'

O'Reilly published a book on the library, now available online.

Related Topic