Database – How to Perform Data Deduplication at Scale

algorithmsdatabase

I need to develop, or at least conceptualize a module that does efficient data deduplication. Say we have millions of data records already. Insertion of another 100 mn records, making sure that there are no duplicate records in the resultant dataset, is what the module needs to do, at the top level. Now this may mean comparing on a field(s) that decides whether records are duplicate or not. But this approach, taken serially, is really naive, when we're talking millions of records.

What do you think can a viable approach be? Hashing? using divide and conquer type algorithms to exploit parallelism? I have these in my head, but it really gets giddy at such scale.

Also, please post any pointers to resources on the Web I can use – I could only find debates and vendors saying things about their db's "supreme data deduplication features".

Best Answer

Hashing?

Essential.

using divide and conquer type algorithms to exploit parallelism?

If necessary.

Consider this.

You have millions of rows already in the DB. Each row must have single surrogate PK field that's (ideally) a number. Each row also has the key fields used for duplicate detection.

Compute a hash of the various comparison key fields. Load that hash and PK into an in-memory hashmap (or treemap). This is 2 million a few megabytes of memory. Hardly anything on a reasonably modern processor. The slowest part is computing the hash of millions of rows.
Each incoming row gets checked against the hashmap. If it's in the map, the comparison keys indicate a duplicate. If it's not in the map, it's okay.

That should cover most bases. The odds of a "false positive" collision is really small. And (for possible duplicates) you simply redo the check to confirm that they keys are indeed a duplicate.

If the comparison key fields are proper keys, they can't be updated. You can denormalize the entire PK-key hash map outside the database and avoid rebuilding it.

If the comparison keys fields are attributes (and can be updated), you have to handle denormalization of the hash values gracefully. A change to a comparison key changes the hash values. Compute it at the time of the update and save a memento of this change so that the comparison hash can be tweaked without being rebuilt from scratch.

Related Solutions

Algorithms – Deduplication and Similarity Detection of Complex Records

There is no silver bullet for de-duplication. You should concentrate first on normalization (from a pattern sense, not 3NF) and standardization. This gives you some kind of level playing field from which to start making comparisons.

To achieve this, you need to apply the standardization techniques that work for each type of data. Standardizing address data is a totally different problem domain than standardizing given names. Most of these data standardization problem domains are much too complex to attempt to solve yourself. Consider buying 3rd party software that does postal address validation and standardization and one which does name standardization.

For things like e-mail addresses or phone numbers, you can probably roll your own, since these are relatively straight-forward by comparison.

Once you've got your data components properly standardized, then you can worry about what's better: fuzzy match, Levenshtein distance, or cosine similarity (etc.)

You're best to consider matching like sub-elements rather than trying to take records as a whole. Then look at how many sub-elements reasonably match. Two identical names with different email addresses and mailing addresses are a very weak match. Two nearly identical names with nearly identical mailing addresses with one record missing the email address is probably a pretty strong match.

Database – Effecient algorithm for data deduplication in procedural code

It seems to me that the only way you can really change the time complexity is to start to turn the Jaro-Winkler algorithm 'inside-out' by collecting statistics on each value that are relevant to the algorithm. I'll be honest and admit that I'd never heard of this algorithm prior to reading your post (thanks!). As such I am going to start typing out some vague ideas and hope the formulate into a coherent approach. Hopefully these ideas with either work or give you some other ideas that do.

So looking at the wiki page it seems there are only three things you need to figure out:

length of the strings: s
transpositions: t
matching characters: m

Getting the length of each string is beyond easy. It's always the same for each string: done. But transpositions and matches are specific to each comparison. But we don't necessarily have to know exact values for these in order to determine whether two strings are a candidate pair. I think you can probably create tests that will help narrow things down.

The first thing that comes to mind is inspired by bloom filter: simply index each string by the characters it contains. Literally take all the letters and put a reference to the strings that contains it.

Then I can take a string like CAT and find all the other words that contain a 'C', 'A', and 'T'. I'll denote that as [C,A,T]. I can then look for [C,A],[C,T],and [A,T]. I'll presume for a moment that anything with less than two of 'C', A', or 'T' would not meet the threshold i.e. you know that m/s1 must be less than 1/3. You can take this a bit further and start calculating a upper-bound value for the comparison. This need not be very precise. For example with the strings in [C,A,T] you might have these upper-bounds:

cattywampus: (3/3 + 3/11 + 1)/3 = 0.758
tack: (3/3 + 3/4 + 1)/3 = 0.917
thwack: (3/6 + 2)/3 = 0.833
tachometer: (3/10 + 2)/3 = 0.767

If your threshold is .8, you can eliminate two of these because you know the best they could do is under your minimum. For the two letter index, you know that the first factor m/s1 can never be more than 2/3 and you can do the same analysis. Any string that has none of 'C', 'A', and 'T' has a result of 0, by definition

I think this is a start to getting away from quadratic time. You can probably do better with some more interesting data structures or heuristics. I'm sure someone can suggest some of these techniques. One not well formulated idea is to extend this to index to not only the characters but also to a character-position index.

Best Answer

Related Solutions

Algorithms – Deduplication and Similarity Detection of Complex Records

Database – Effecient algorithm for data deduplication in procedural code

Related Topic