Natural Language Processing – Purpose of Chunking Strings

algorithmsnatural-language-processingstringsterminology

I see many examples of libraries that use a 'chunker' and also people asking how to write chunkers, but what is this for and why do we need it? Isn't it enough to split a text by whitespace or other characters that delimit boundaries between words?

Best Answer

Isn't it enough to split a text by whitespace or other characters that delimit boundaries between words?

In many cases, natural language processing is used to pick out pieces of sentences without necessarily analyzing the entire sentence. However, those pieces may consist of several words, so using white space to simply break the sentence into words isn't very helpful. Imagine that you're building database of historical facts, and a tiny portion of the input text looks like this:

Tony Orlando was born on April 3, 1944 in New York City and later moved to New Jersey.

In that case it's probably useful to know that this fact involves a person, a date, two places, all of which consist of multiple words that aren't very useful by themselves. A chunker can break that sentence into phrases that are more useful than individual words, like Tony Orlando and New York City and even born on April 3, 1944. Identifying meaningful terms could speed up searching and yield better results.

Related Topic