Algorithms – Techniques for Faster Performance Comparison

algorithmscomparisonperformance

I need to cross names from two lists, and find all occurrences of one name in the other. The lists are too big, one have 50k elements and the other 400k.

For a small list I would use two foreach cycles or Linq, but I can't be running the program for days.

What is your advice to perform a fast comparisons?

EDIT: In some cases I need to find more than occurrence on the second list, by other words, all the names are repeated and have different information associated. Then the intention is to merge the information from the two sources.

Best Answer

My suggestion would be to slurp the large list into a hash set, then use that to match items from the small list.

A hash set is a structure that stores elements in an indexable memory structure, like an array, where the position of the element is equal to some hash value calculated using the object. That means that looking for a value in the hashset is a relatively fast operation; calculate the hash of the object you're looking for, go to that index, and check the actual objects stored there, which for a good implementation will be a very small number (hashset implementations have to strike a balance between the size of the hash and therefore the number of first-dimension elements, and the number of collisions and therefore the average number of items in each element).

Ideally, hashsets approach a constant lookup time (specifically it's O(log2^HN) where H is the bitsize of the hash function, so for all N < 2^H it's effectively constant), so overall, your matching algorithm would approach linear complexity. Two major downsides are first that unless you have access to a built-in efficient implementation (Java's HashMap is built on this structure, as is .NET's Dictionary class), you have to roll your own which is quite a bit of code, and second, hashsets are real memory hogs because there's virtually guaranteed to be a lot of empty spaces in the array unless your implementation varies its hash function based on expected or actual capacity (which could, if naively done, involve re-hashing every element several times as the first dimension is extended to limit growth in the second dimension).

Related Topic