NLP Algorithms – Splitting and Combining Words into Common Forms

Are there any existing algorithms which can look through a list of words and split or combine words into their more common form?

For example, I have a list of many business names in the health care industry. The word "healthcare" is often written "health care". There are also business names which might be split or combined, such as "Walmart" and "Wal mart".

Are there any algorithms which can look at my list of words and identify that "healthcare" is more often written as two words, and that "Wal mart" is more often written as a single word?

I'm looking for the names of existing algorithms (which can help when searching the web), or links to existing white-papers or blog posts.

I'd prefer an algorithm that doesn't depend on a dictionary or other external list of words or business names.

EDITS:

Background:

I already have some code that is moderately successful at this task. The code was thrown together without much rigor. I hoped there were some established algorithms, which would likely be more academic and complete then what I've come up with. This question is not about the method I've come up with, but saying "it's impossible" doesn't convince me.

Clarification:

The "more common form" of a word is the way the word(s) are most often written. For example, "Walmart" appeared many times, and "Wal mart" appeared many times, but "Walmart" appeared more often then "Wal mart" and so "Walmart" is the "more common form" of the word.

I don't expect this algorithm to produce perfect results. Like any machine learning problem, I expect the results to be dependent on the quality of data I give it, and how much data I have.

Best Answer

Generally I think you are after linguistic normalization and the algorithms that are applicable to your description of the problem are the algorithms that solve the problem of polysemy and collocations in particular.

The word "healthcare" is often written "health care" ...

The accepted defintion for collocations is a combination of adjacent words that have a common meaning. The hypernym of "health care" and "healthcare" is social insurance for the ill and injured, this is coincidently is also a related hypernym for "medicare" (though they are not exactly the same, but I presume you are interested in business names that could mention all the above).

The WordNet lexical database is one of the largest and you can use its search facilities to explore collocations and hypernyms.

The hypernyms, collocations and the semantic relationships are typically aggregated in a database, and I am unconvinced that,

... an algorithm that doesn't depend on a dictionary or other external list of words or business names.

is a viable approach. In the best case you would be essentially eschewing the shoulders of giants and slowly re-building what's already available in existing lexical databases and collocation dictionaries as your algorithms accumulates and stores the interpretation of the collocations you encounter in your tasks.

Here are some additional resources and links,

Oxford Collocation dictionary (1, 2)
BBI Dictionary of English Word Combinations
EuroWordNet
Longman Collocations Dictionary and Thesaurus

For locating the necessary reasearch papers and algorithms I suggest that you simply employ citeseer with collocations as the main term, it is fairly unique to natural language processing. I am not sure though as I expressed above that you will be able to find an online algorithm that doesn't rely on dictionaries or pre-existing learning corpora for your task.

Best Answer

Related Solutions

Related Topic