How to best detect duplicate data in a large dataset

algorithmsdata structures

I recently heard of the statistic "87% of the US population can be uniquely identified by a tuple of their zip code, birth date and gender". This is apparently not true, and I was wondering how I would verify it if I had the census data.
So imagining I had a 300-millions-line-long unsorted text file containing the gender, zip code and birth date of each person living in the US, what would be the quickest way of knowing what percentage of the population is uniquely identifiable by that tuple?

This should be a matter of identifying what percentage of the entries are duplicated in the dataset, but what would be a good way to go about it? I'm interested in useful algorithms and efficient data structures, and speed is more important than memory consumption as long as the latter is kept to a reasonable level.

Best Answer

SQL solution

You could load all the demographic data into an SQL database:

CREATE TABLE PERSON(Id integer PRIMARY KEY, zip text, birth date, gender char /*... */);
...

Unfortunately the file importing statement is not SQL standard (e.g. BULK INSERT for SQLServer, LOAD DATA INFILE for mysql, or use SQL*Loader for Oracle).

The easiest and most efficient way would then be to use aggregate functions with a GROUP BY clause to count number of persons sharing the same values for the grouping columns, and keeping only those with duplicates, using a HAVING clause:

SELECT zip, birth, gender, count(*) FROM PERSON 
   GROUP BY zip, birth, gender
   HAVING count(*)>1;

Online demo

Sorted file solution

You could als get your census file sorted by zip, birth and gender. Then you could read the data, compare each record read to the previous one, and if the same, and count until these value change for a record.

Pseudocode:

lastrecord = {  };
counter = 1; 
while there's a record to read {
    read record 
    if (record.zip == lastrecord.zip 
          and record.birth==lastreacord.birth 
          and record.gender == lastrecord.gender) {
       counter = counter +1; 
    } 
    else {
         if (counter>1)  {    // output the count of duplicates
               write lastrecord.zip, lastrecord.birth, lastrecord.gender, counter
         }  
         counter =1; 
    }      
    lastrecord = record; 
}
if (counter>1)  {    // output the count of duplicates
     write lastrecord.zip, lastrecord.birth, lastrecord.gender, 
}

Associative map

A last way, here would be to read each record as it comes, and store the 3 tuple values in a map:

store 1 if the tuple was not yet loaded
increment existing tuple value if it already exists

In the end, iterate trough the map and process the elements having a count greater than 1. Ok, this one will cost you some memory ;-)

Related Solutions

Question, the best data structure and algorithm

Why not use balanced trees? You would need one tree at each store, mapping item ID to some internal information, and one global tree, mapping item ID to the list of the nodes having the given item. Your operations will be O(log n + log k).

Edit:
the operations in this case would look like the following: (I use the fact that std::map and std::set are tree-based in C++)

using namespace std;

typedef int itemid_t;
typedef int storeid_t;

map<itemid_t, set<storeid_t>> itemID2storeList;

class Store
{
    storeid_t Id;

    // insert requires one lookup in the map (O(log k))
    // plus one insert into a set (O(log n)).
    void insert(itemid_t item)
    {
        itemID2storeList[item].insert(Id);
    }

    // the same as for insert
    void delete(itemid_t item)
    {
        itemID2storeList[item].erase(Id);
    }

    // search in a store requires one lookup in a map (O(log k))
    // and one search in a set (O(log n)).
    bool have(itemid_t item)
    {
        const set<storeid_t>& storesForItem = itemID2storeList[item];
        return storesForItem.find(Id) != storesForItem.end();
    }
}

class Central
{
    // getting the list of stores having the given item requires
    // one lookup in a map (O(log k))
    const set<storeid_t>& storesForItem(itemid_t item)
    {
        return itemID2storeList[item];
    }
}

Algorithm Execution – How and Where to Run Algorithms on Large Datasets?

The nice thing about the PageRank algorithm is that it can be solved iteratively in a distributed way, within the MapReduce framework. However, the working data for Pagerank on ~5M nodes and ~50M edges should fit perfectly well in 4GB ram, never mind 48GB....

Specifically, you don't need to store all data for each web page in memory -- instead, you should digest your input database so that the working data for the PageRank solver refers to nodes by index. Even with no particular effort at optimization, each node should take no more than 32 bytes, and each edge no more than 16 bytes, for in-memory space of less than 1GB.

A demonstration example for this kind of datastructure, in C++/STL:

std::vector<float> old_rank, new_rank;  // rankings for each node
std::vector<int> end_edge;  // index after final edge for each node
std::vector<int> edge_dest;  // destination node index for each edge
std::vector<float> edge_weight;  // fractional weight for each edge

...

void pagerank_iteration(float base_value, float scale_value) {
    new_rank.fill(0.0);
    int edge = 0;  // loop variable:  current edge index
    for(int node=0; node<first_edge.size(); ++node) {  // loop over nodes
        while(edge < end_edge[node]) { // loop over edges of current node
            int dest_node = edge_dest[edge];
            new_rank[dest_node] += edge_weight[edge] * old_rank[node];
            ++edge;
        }
    }
    assert(edge == edge_dest.size());

    for(int node=0; node<new_rank.size(); ++node) {  // add scale/offset
        new_rank[node] = base_value + scale_value * new_rank[node];
    }
}

Running on a single PC is easier than running it on a cloud service, because you don't need to use a network-capable framework (although it might be a good idea to keep that possibility in mind). The relatively small sizes you are describing can be solved easily with an ad-hoc single-threaded algorithm, and you can either use Hadoop locally or roll your own MapReduce using threads or inter-process communication if you want more cores.

Best Answer

Related Solutions

Question, the best data structure and algorithm

Algorithm Execution – How and Where to Run Algorithms on Large Datasets?

Related Topic