C# – Writing a spell checker similar to “did you mean”

algorithmscmemorysearch

I'm hoping to write a spellchecker for search queries in a web application – not unlike Google's "Did you mean?" The algorithm will be loosely based on this: http://catalog.ldc.upenn.edu/LDC2006T13

In short, it generates correction candidates and scores them on how often they appear (along with adjacent words in the search query) in an enormous dataset of known n-grams – Google Web 1T – which contains well over 1 billion 5-grams.

I'm not using the Web 1T dataset, but building my n-gram sets from my own documents – about 200k docs, and I'm estimating tens or hundreds of millions of n-grams will be generated.

This kind of process is pushing the limits of my understanding of basic computing performance – can I simply load my n-grams into memory in a hashtable or dictionary when the app starts? Is the only limiting factor the amount of memory on the machine?

Or am I barking up the wrong tree? Perhaps putting all my n-grams in a graph database with some sort of tree query optimisation? Could that ever be fast enough?

Best Answer

I'd first implement it in memory, because it is simplest and it may just work. However, I expect it won't. I'd then move to some offline system you are familiar with (database?) and see how slow it is. It may be fast enough that it stays simple. I suspect it will be too slow.

Now we come to the fun part. The first word I thought of when I read this is "cache". With any luck, you will have enough memory to build a big enough cache that it solves the speed problems with the database, because you won't access it that often.

If that doesn't work, then the next thing I would try would be to pack my information as tight as possible to try to get it more in memory. This will take more work, as you get to learn how your chosen language stores information. You may end up switching to a language in which you have more control over the amount of space it takes to store an n-gram, or you get to compress and uncompress them when you need them. This should make you a better programmer, but it will take longer.

As always, find your own path, do the easiest stuff first, and don't be afraid to experiment and fail. Good luck.