Database – Effecient algorithm for data deduplication in procedural code

algorithm-analysisalgorithmsdatabaseefficiencyspeed

I have written a data cleansing application which, for the most part, works well. It is not designed to handle large volumes of data: nothing more than about half a million rows. So early on in the design process, a decision was made to try and do as much of the work in-memory as possible. The thought was that writing to database or disk would slow things down.

For most of the various cleaning operations the application offers, this has proved true. When it comes to deduplication, however it is absurdly slow. Running on a fairly powerful server, it takes about a full 24 hours to dedupe half a million rows of data.

My algorithm runs along these steps in pseudocode:

List<FileRow> originalData;
List<FileRow> copiedData = originalData.Copy;

foreach(FileRow original in originalData)
{
    foreach(FileRow copy in copiedData)
    {
        //don't compare rows against themselves
        if(original.Id != copy.Id) 
        {
            // if it's a perfect match, don't waste time with slow fuzzy algorithm
            if(original.NameData == copy.NameData)
            {
                original.IsDupe = true;
                break;
            }

            // if it's not a perfect match, try Jaro-Winkler
            if(_fuzzyMatcher.DataMatch(original.NameData, copy,NameData))
            {
                original.IsDupe = true;
                break;
            }
        }
    }
}

Look at this, it's obvious why it's so slow: where other operations can cycle through each row, this has to go through the whole file again for each row. So the processing time increases exponentially.

I have also used threading elsewhere to speed things up, but my attemtps to thread this have failed. In the real-world code we don't just make duplicates as "true" but group them, so that all instances of a given match get a unique Id. But the procedure has no way of knowing whether another thread has found and marked a duplicate, so threading leads to errors in the grouping Id assignment.

To try and improve matters we added a db-based cache of common Jaro-Winkler matches to try and eliminate the need for that relatively slow method. It didn't make a significant difference.

Is there another approach I can try, or improvements I can make to this algorithm to make it faster? Or am I better off giving up trying to do this in memory and writing it to a database to do the job there?

Best Answer

It seems to me that the only way you can really change the time complexity is to start to turn the Jaro-Winkler algorithm 'inside-out' by collecting statistics on each value that are relevant to the algorithm. I'll be honest and admit that I'd never heard of this algorithm prior to reading your post (thanks!). As such I am going to start typing out some vague ideas and hope the formulate into a coherent approach. Hopefully these ideas with either work or give you some other ideas that do.

So looking at the wiki page it seems there are only three things you need to figure out:

  • length of the strings: s
  • transpositions: t
  • matching characters: m

Getting the length of each string is beyond easy. It's always the same for each string: done. But transpositions and matches are specific to each comparison. But we don't necessarily have to know exact values for these in order to determine whether two strings are a candidate pair. I think you can probably create tests that will help narrow things down.

The first thing that comes to mind is inspired by bloom filter: simply index each string by the characters it contains. Literally take all the letters and put a reference to the strings that contains it.

Then I can take a string like CAT and find all the other words that contain a 'C', 'A', and 'T'. I'll denote that as [C,A,T]. I can then look for [C,A],[C,T],and [A,T]. I'll presume for a moment that anything with less than two of 'C', A', or 'T' would not meet the threshold i.e. you know that m/s1 must be less than 1/3. You can take this a bit further and start calculating a upper-bound value for the comparison. This need not be very precise. For example with the strings in [C,A,T] you might have these upper-bounds:

cattywampus: (3/3 + 3/11 + 1)/3 = 0.758
tack: (3/3 + 3/4 + 1)/3 = 0.917
thwack: (3/6 + 2)/3 = 0.833
tachometer: (3/10 + 2)/3 = 0.767

If your threshold is .8, you can eliminate two of these because you know the best they could do is under your minimum. For the two letter index, you know that the first factor m/s1 can never be more than 2/3 and you can do the same analysis. Any string that has none of 'C', 'A', and 'T' has a result of 0, by definition

I think this is a start to getting away from quadratic time. You can probably do better with some more interesting data structures or heuristics. I'm sure someone can suggest some of these techniques. One not well formulated idea is to extend this to index to not only the characters but also to a character-position index.

Related Topic