Algorithms – Deduplication and Similarity Detection of Complex Records

algorithmsdatadatabasedesign

I'm working on a project that involves records with fairly large numbers of fields (~15-20) and I'm trying to figure out a good way to implement deduplication. Essentially the records are people along with some additional data. For example, the records are likely to include personal information like first name, last name, postal address, email address, etc. but not all records have the same amount of data.

Currently records are stored in a RDBMS (MySQL) and I want to detect duplicates on insertion and still have them inserted but flagged as a duplicate. It needs to be fast as I need to provide feedback as to if it is a duplicate or not in real time. The dataset is large (millions of records).

I've considered the following options but I'm not sure which is best/if they are better options available:

  • Use MySQL's built in fulltext search and use fuzzy searching. Major issue with this is that is seems slow, only the latest version supports fulltext indexes with InnoDB (alternative engine is MyISAM which is not good and critically does not support transactions) and fuzzy searching alone does not seem the best method for similarity detection.
  • Use simhash or similar. Issue with this is that I'd also like to be able to detect synonyms which I don't see how simhash handles this. For example, address might be: "Some Road" or "Some Rd." and names might be: "Mike" or "Michael"
  • Index the data using an Apache Lucene derivative (elasticsearch/solr/etc) and perform a query that would likely return numerous results.

In terms of using Apache Lucene I've been reading about similarity detection and using cosine similarity to produce a value from 0 to 1 from the term frequency vectors that lucene stores. I could apply this to the results from the lucene query and check to see if any of the results are above a certain threshold. My concern about this is how relevant the cosine similarity would be for the type of data I'm storing, i.e a number of fields with either single or a small number of words compared to calculating the cosine similarity of a comparison of some large text document.

Basically, I'm wondering what is the best way to deduplicate this type of data (or put alternatively, detect similarities with this type of data)?

Best Answer

There is no silver bullet for de-duplication. You should concentrate first on normalization (from a pattern sense, not 3NF) and standardization. This gives you some kind of level playing field from which to start making comparisons.

To achieve this, you need to apply the standardization techniques that work for each type of data. Standardizing address data is a totally different problem domain than standardizing given names. Most of these data standardization problem domains are much too complex to attempt to solve yourself. Consider buying 3rd party software that does postal address validation and standardization and one which does name standardization.

For things like e-mail addresses or phone numbers, you can probably roll your own, since these are relatively straight-forward by comparison.

Once you've got your data components properly standardized, then you can worry about what's better: fuzzy match, Levenshtein distance, or cosine similarity (etc.)

You're best to consider matching like sub-elements rather than trying to take records as a whole. Then look at how many sub-elements reasonably match. Two identical names with different email addresses and mailing addresses are a very weak match. Two nearly identical names with nearly identical mailing addresses with one record missing the email address is probably a pretty strong match.

Related Topic