C# – Word Recognition in a String Without Spaces or Punctuation Marks

cdata miningregular expressionsstatisticsstrings

I have a small C# project that reads a file and gives me an output: a string that does not contain spaces nor any types of punctuation marks. It may also contain a few misspellings.

Ex.
Output:
THEQUICKBROWNFOXJUMPSOVERTHELAZYDOG

I wonder if there is a way to analyze this string by using text mining/data mining and/or regular expressions to indentify the words (preferably nounds, verbs and so forth.) in the string?

I want to read a bunch of files giving me different outputs and put them in statistical order from the one with the most found words to the one which only contain a string of mumbo jumbo.

Also, if the string contain misspellings like:
THEQUICGBROWNFOSJUMPSOVERTHHLAZYDOG
I know that regular expression can calculate the "distans" from a misspelled word and find the most matching one (using a corpus and probability) but this might prove more of a challenge as the string does not have any spaces or punctuation marks.
Any ideas how I can solve this?

Best Answer

Here is the general approach:

  1. Read a dictionary file and organize all words in a trie data structure. Many Unix systems have such files in the /usr/share/dict/ directory.

  2. Find possible matches of a prefix of your input in the trie. This will usually produce multiple matches, for example theologyisabout begins with theology and the.

  3. If we remove the matched prefixes, we get a set of possible continuations, on which we repeat step 2.

We then end up with a vast tree of possible interpretations.

There are two problems with this:

  • there will be an exponential amount of interpretations
  • we might miss interpretations because of an unknown word, or some unknown grammatical form

We can solve both of these problems by fuzzy matching. When we look up prefixes in the trie, we allow letters to be missing, inserted or changed. However, each such aberration increases the Levenshtein distance. If one interpretation has a too high summed Levenshtein distance, we can prune that interpretation and concentrate on other branches. You could also keep the branches in a priority queue and always investigate the branches with the lowest current edit distance, which is most likely to be a sensible interpretation – not unlike Dijkstra's pathfinding algorithm.

Note that multiple prefix sequences with different edit distances might lead to the same remaining string. You can therefore keep your progress in a data structure that allows parts to be shared. This caching will likely be beneficial for performance. If you in fact try to implement a variant of Dijkstra's algorithm here, a known tail would correspond to a visited node in the graph.

The difficult part is how to actually perform the fuzzy matching. E.g. you could decide on a maximum edit density of x edits per character (0 <= x <= 1), and abort an interpretation if it is guaranteed that this interpretation will have a higher density. For a given string with length l we can therefore determine an edit budget b = x · l. This budget is less important when matching prefixes in the trie, but this trie is only useful if there are less edits than characters in the prefix. An edit budget like b = floor(c / 2) with a prefix of length c might be sensible. How much edits you allow is not only a metric for how garbled texts you allow your system to “understand”, but also a performance setting – smaller budgets run faster, as less alternatives have to be investigated.

Related Topic