I thought about this, and have been trying to come up with solutions on how to fuzzy search a database, if for say example a user types a spelling mistake. Any glaring problems with the logic behind this? Will it work and has it been done before?
Our table we wish to search:
**tblArticles**
Body - Soundex_Body - CharacterCoded_Body
So we store the raw text body for physical display. The other 2 columns are used for searches which are precomputed the following way:
Soundex
Body is split into it's words, and translated to it's soundex version. IE, resulting body might be something like:
H252 B54 C23 E33... etc
So someone might enter 'dinosore', and the article body reads 'dinosaur' these both evaluate to B26. We then run a LIKE on the soundex value of search term.
Character Coded
Given a character mapping that maps chars to prime numbers, IE:
h = 2
e = 3
l = 5
o = 7
p = 11
c = 13
help = 2*3*5*11 = 330
hello = 2*3*5*5*7 = 1050
hell = 2*3*5*5 = 150
hlep = 2*5*3*11 = 330
cello = 13*3*5*5*7 = 6825
If a user meant to type 'hello' but they switched two or more characters around for example 'hlelo', they would evaluate to the same number. Split raw body into words, prime encode every word and store in database giving you a field that looks like:
330 6825 330 1050... etc
We can then like search on this value to match mistypes.
Benefits
- Typos protected against
- Phonetic incorrect spellings protected against
- More non native english speaking friendly
- Will work in any language (where soundex works)
Comments and thoughts? A sort of multilayered search. You can of course weight return values to make it even better (IE a literal text body match is worth more), but is this a good solution for spelling errors and non native english speakers doing searches?
Best Answer
There are a number of other search algorithms. Smith-Waterman is one of the better ones for human text, while BLAST is (so far) the best for searching DNA sequences. When you are presented text with various spelling errors such as
hlep
instead ofhelp
, then you are looking for the minimum edit distance.For a library to implement a number of these functions in CLR in SQL Server 2005 (and later), look at the source forge project SimMetrics. Blog post about SimMetrics.
http://staffwww.dcs.shef.ac.uk/people/S.Chapman/simmetrics.html
Soundex was developed because the primary differences between regional speech variations was almost exclusively in the vowels - which is why it tosses vowels out. It isn't good at coping with transposed letters.