Database Search – Understanding Fuzzy Search Concepts

algorithmsconceptssearch

I thought about this, and have been trying to come up with solutions on how to fuzzy search a database, if for say example a user types a spelling mistake. Any glaring problems with the logic behind this? Will it work and has it been done before?

Our table we wish to search:

**tblArticles**
Body - Soundex_Body - CharacterCoded_Body

So we store the raw text body for physical display. The other 2 columns are used for searches which are precomputed the following way:

Soundex

Body is split into it's words, and translated to it's soundex version. IE, resulting body might be something like:

H252 B54 C23 E33... etc

So someone might enter 'dinosore', and the article body reads 'dinosaur' these both evaluate to B26. We then run a LIKE on the soundex value of search term.

Character Coded

Given a character mapping that maps chars to prime numbers, IE:

h = 2
e = 3
l = 5
o = 7
p = 11
c = 13

help = 2*3*5*11     =   330
hello = 2*3*5*5*7   =   1050
hell = 2*3*5*5      =   150
hlep = 2*5*3*11     =   330
cello = 13*3*5*5*7  =   6825

If a user meant to type 'hello' but they switched two or more characters around for example 'hlelo', they would evaluate to the same number. Split raw body into words, prime encode every word and store in database giving you a field that looks like:

330 6825 330 1050... etc

We can then like search on this value to match mistypes.

Benefits

  • Typos protected against
  • Phonetic incorrect spellings protected against
  • More non native english speaking friendly
  • Will work in any language (where soundex works)

Comments and thoughts? A sort of multilayered search. You can of course weight return values to make it even better (IE a literal text body match is worth more), but is this a good solution for spelling errors and non native english speakers doing searches?

Best Answer

There are a number of other search algorithms. Smith-Waterman is one of the better ones for human text, while BLAST is (so far) the best for searching DNA sequences. When you are presented text with various spelling errors such as hlep instead of help, then you are looking for the minimum edit distance.

For a library to implement a number of these functions in CLR in SQL Server 2005 (and later), look at the source forge project SimMetrics. Blog post about SimMetrics.
http://staffwww.dcs.shef.ac.uk/people/S.Chapman/simmetrics.html

Soundex was developed because the primary differences between regional speech variations was almost exclusively in the vowels - which is why it tosses vowels out. It isn't good at coping with transposed letters.

Related Topic