Database – Matching algorithms in context of data quality problems

algorithmsdatadatabasestring-matching

I work for an organization that has lots of databases containing person information. The data quality is poor. One case was a surname I found like this (this is the worst-case scenario):

Mark "Dunno his surname, but it sounded like Lion 'RAR', Ha ha"

There is a date of birth of: 01/09/1499

This is a worst-case scenario. Most data quality issues are due to pressing a wrong key on the keyboard e.g. Snith instead of Smith (n is next me m on the keyboard).

I am looking for algorithms which can help me with some kind of "fuzzy matching" under these circumstances. Our requirements involve several millions of records per day. I have looked for "data matching" and discovered the following algorithms:

SQL SOUNDEX 
SQL METAPHONE
Levenshtein Distance

Also is there such thing as a possible match for dates of birth? A possible match for a surname were the Levenstein distance is 80%.

Therefore I have two questions:

  1. What algorithms are available except the three specified above?

  2. What approaches are used to match possible addresses

Best Answer

For date of birth and people in general, one will need to validate if that person exists and if thier address is valid.

You can do both of these using 3rd party services that will:

  • Validate if the person is a real person
  • Validate and standardize address information

You can do this yourself using algorithms that you have mentioned, but why re-invent the wheel and most likely your implementation will be flawed.

Most of these services are configurable and allow you to send information like name, DOB, phone number and will attempt to look up to see if that person exists.

The same holds true for address information. It will attempt standardize the address, locate it (see if it really exists) and then return the standardized address back to you.

In both of these scenarios, no hits can be returned, which means that the person is not found or the addresss is not found. In those cases you probably have a fake individual or address like the one in your sample data.

Also, sometimes it might return multiple matches or less than 100% matches. For example, if the only piece of valid information is a phone number, it might return back multiple people at that phone number.

For less than absolute matches you might set a confidence factor which you want to match on, like 90%. Anything below that number goes into a report which will require manual intervention. People who don't match should be eliminated from your datastore as it is junk data.

Related Topic