C++ Algorithms – Fastest Way to Find All Numbers with Specific Digits

algorithmscpython

I have a huge set of over a million numbers of variable lengths.

['773', '2267', '8957251', '170597519', '373590109', '982451707', '999999937', ......]

Now given a bunch of digits, say 3 and 7, I intend to create a subset of all numbers which contain those digits. That is:

['773', '373590109', '999999937', ......]

This kind of lookup happens multiple times for different digits. Thus iterating over the entire list each time is not an option.

I am considering creating 10 subsets, one for each digit, on program startup. Each set will contain all numbers containing that digit. I then plan to use set intersection on each lookup. What do think of this method? Is there a better way which would achieve faster results?

I am implementing this in C++. But I am open to do the same in Python if there is an easier way available.

Best Answer

Lets take a step back for a moment and try to understand why you want to do this. You want to be able to perform a lookup for the existence of a value in a list of millions of numbers? Do you need to do this many times or just once?

The reason I ask is because if it is just once, the most efficient way is simply to load the numbers into memory perform a single pass for the existence of that number in the array. Threads would make this even more efficient, breaking the work into smaller pieces. If the array were too large for this, you could load it one chunk at a time, though I assume if you're potentially duplicating numbers in 10 different arrays, one per digit, memory is not an issue here.

If you need to perform repeated lookups, may I recommend using a trie. You create 9 bins containing the first digit to search for. That child then has 10 bins containing the the second digit to search for. Repeat as often as you wish. With respect to the 10 arrays approach that you mentioned, this would most certainly occupy less memory (the non-leaf nodes in the tree represent multiple numbers, not just one).

The reason why your approach might be potentially inefficient is because the chance of a number containing a digit increase significantly with each additional digit. With a 10 digit number, the chance of a particular digit not showing up is roughly (9/10)^10 or 34%. That means there's a 66% chance that a 10 digit number will show up in a given set, which roughly means we're talking about a 66% duplication of data. Smaller numbers might be more favorable, but even a 3 digit number has a 27% duplication rate.

Even if memory were not an issue, the replication of numbers is making your lookup algorithm less efficient.

Therefore my recommendation is to use a Trie for lookups. Though to answer your question, to get a list of all numbers containing a specific digit, your approach is best. Though unless I can't see the tree from the forest, this approach is probably not what you want to achieve.