I recently heard of the statistic "87% of the US population can be uniquely identified by a tuple of their zip code, birth date and gender". This is apparently not true, and I was wondering how I would verify it if I had the census data.
So imagining I had a 300-millions-line-long unsorted text file containing the gender, zip code and birth date of each person living in the US, what would be the quickest way of knowing what percentage of the population is uniquely identifiable by that tuple?
This should be a matter of identifying what percentage of the entries are duplicated in the dataset, but what would be a good way to go about it? I'm interested in useful algorithms and efficient data structures, and speed is more important than memory consumption as long as the latter is kept to a reasonable level.
Best Answer
SQL solution
You could load all the demographic data into an SQL database:
Unfortunately the file importing statement is not SQL standard (e.g.
BULK INSERT
for SQLServer,LOAD DATA INFILE
for mysql, or use SQL*Loader for Oracle).The easiest and most efficient way would then be to use aggregate functions with a
GROUP BY
clause to count number of persons sharing the same values for the grouping columns, and keeping only those with duplicates, using aHAVING
clause:Online demo
Sorted file solution
You could als get your census file sorted by zip, birth and gender. Then you could read the data, compare each record read to the previous one, and if the same, and count until these value change for a record.
Pseudocode:
Associative map
A last way, here would be to read each record as it comes, and store the 3 tuple values in a map:
In the end, iterate trough the map and process the elements having a count greater than 1. Ok, this one will cost you some memory ;-)