Algorithms – Best Algorithm for String Similarity

algorithmsstring-matching

I am designing a plugin to uniquely identify content on various web pages, based on addresses.

So I may have one address which looks like:

1 someawesome street, anytown, F100 211

later I may find this address in a slightly different format.

1 someawesome street, F100 211,

or perhaps as vague as

someawesome street F100

These are technically the same address, but with a level of similarity. I would like to a) generate a unique identifier for each address to perform lookups, and b) figure out when a very similar address shows up.

What algorithms / techniques / String metrics should I be looking at? Levenshtein distance seems like an obvious choice, but curious if there's any other approaches that would lend themselves here.

Best Answer

Levenstein's algorithm is based on the number of insertions, deletions, and substitutions in strings.

Unfortunately it doesn't take into account a common misspelling which is the transposition of 2 chars (e.g. someawesome vs someaewsome). So I'd prefer the more robust Damerau-Levenstein algorithm.

I don't think it's a good idea to apply the distance on whole strings because the time increases abruptly with the length of the strings compared. But even worse, when address components, like ZIP are removed, completely different addresses may match better (measured using online Levenshtein calculator):

1 someawesome street, anytown, F100 211       (reference) 
1 someawesome st.,anytown                     (difference of 15, same address)     
1 otherplaces street,anytown,F100211          (difference of 13, different ddress) 
1 sameawesome street, othertown, CA98200      (difference of 13, different ddress)
anytown, 1 someawesome street                 (28 different same address)
anytown, F100 211, 1 someawesome street       (37 different same address)

These effects tend to worsen for shorter street name.

So you'd better use smarter algorithms. For example, Arthur Ratz published on CodeProject an algorithm for smart text comparison. The algorithm doesn't print out a distance (it can certainly be enriched accordingly), but it identifies some difficult things such as moving of text blocks (e.g. the swap between town and street between my first example and my last example).

If such an algorithm is too general for your case, you should then really work by components and compare only comparable components. This is not an easy thing if you want to parse any address format in the world. But if the target is more specific, say US, it is certainly feasible. For example, "street", "st.", "place", "plazza", and their usual misspellings could reveal the street part of the address, the leading part of which would in principle be the number. The ZIP code would help to locate the town, or alternatively it is probably the last element of the address, or if you don't like guessing, you could look for a list of city names (e.g. downloading a free zip code database). You could then apply Damerau-Levenshtein on the relevant components only.

Related Topic