We are looking for a way to tokenize some text in the same or similar way as a search engine would do it.
The reason we are doing this is so that we can run some statistical analysis on the tokens. The language we are using is python, so would prefer a technique that works in that language, but could probably set something up to use another language if necessary.
Example
Original token:
We have some great burritos!
More simplified: (remove plurals and punctuation)
We have some great burrito
Even more simplified: (remove superfluous words)
great burrito
Best: (recognize positive and negative meaning):
burrito -positive-
Best Answer
Python has a great natural language toolkit, the NLTK. It supports word tokenisation out of the box:
The last structure includes natural language tags, which allow you to drop words from consideration based on their classification. You probably want to focus on the
JJ
(adjective) andNN
-prefixed (noun) tags.From there on out you can apply stemming, and detect positive and negative adjectives.
I believe that for adjective classification, however, you'd need to create your own corpus from online resources such as this; the library does give you the tools for this.
Here is a stemming example using the Porter stemming algorithm:
O'Reilly published a book on the library, now available online.