Database – How to Perform Data Deduplication at Scale

algorithmsdatabase

I need to develop, or at least conceptualize a module that does efficient data deduplication. Say we have millions of data records already. Insertion of another 100 mn records, making sure that there are no duplicate records in the resultant dataset, is what the module needs to do, at the top level. Now this may mean comparing on a field(s) that decides whether records are duplicate or not. But this approach, taken serially, is really naive, when we're talking millions of records.

What do you think can a viable approach be? Hashing? using divide and conquer type algorithms to exploit parallelism? I have these in my head, but it really gets giddy at such scale.

Also, please post any pointers to resources on the Web I can use – I could only find debates and vendors saying things about their db's "supreme data deduplication features".

Best Answer

Hashing?

Essential.

using divide and conquer type algorithms to exploit parallelism?

If necessary.

Consider this.

You have millions of rows already in the DB. Each row must have single surrogate PK field that's (ideally) a number. Each row also has the key fields used for duplicate detection.

  1. Compute a hash of the various comparison key fields. Load that hash and PK into an in-memory hashmap (or treemap). This is 2 million a few megabytes of memory. Hardly anything on a reasonably modern processor. The slowest part is computing the hash of millions of rows.

  2. Each incoming row gets checked against the hashmap. If it's in the map, the comparison keys indicate a duplicate. If it's not in the map, it's okay.

That should cover most bases. The odds of a "false positive" collision is really small. And (for possible duplicates) you simply redo the check to confirm that they keys are indeed a duplicate.

If the comparison key fields are proper keys, they can't be updated. You can denormalize the entire PK-key hash map outside the database and avoid rebuilding it.

If the comparison keys fields are attributes (and can be updated), you have to handle denormalization of the hash values gracefully. A change to a comparison key changes the hash values. Compute it at the time of the update and save a memento of this change so that the comparison hash can be tweaked without being rebuilt from scratch.

Related Topic