Algorithms – Fastest Way to Determine if a Value is in a List

algorithmsdata structuresefficiencylanguage-agnostic

I'm aware there may not be a good solution to this problem. I'm looking for a fast algorithm that determines whether or not a value (let's say an integer) exists in a set of values, however I need the algorithm to also be fast if the value is not in the set. I.e.: Most methods I can think of right now that determine if a value is in a list will take longer to determine that it is not as they will not exit early.

I've been experimenting with using the keys of an unordered set but found them to actually be rather slow to verify. I'm not really sure what's going on behind the scenes with this data type. I'm aware that a straight forward array search might be faster than looking up a key but the size of the array is not properly known yet, only that it will be very large.

Properties and assumptions:

  • Assume all the values are integers between 0 and N, where N is likely to be in the millions.
  • The list will need to be checked very quickly, several 100 times per millisecond, but usually roughly in the same place each time with only a gradual deviation in the target area.
  • There is no constraint on the preparation time available for setting up a data structure, i.e.: the list could be organised specially somehow before hand.
  • There will be no duplicate values in the data.

Best Answer

If there is a known maximum N, you can use a Bit array for really fast lookup time.

Simply keep an array of size N/8 (rounded up) around, with each bit corresponding to a number, 1 if it is in the set, 0 if it isn't. Lookup the relevant bit to check whether a number is in the set.

If this is too slow, and you have the megabytes ("millions" doesn't sound very high), then just keep a boolean array of size N.

Edit: for much larger values of N this won't fit into memory, and you probably need some sort of hashing. It'd still be nice to use the fact that subsequent calls are likely to be close to each other, though.

So a sketch of an idea: use the number divided by 256 as the key, and store a 256-bit bit array as the value (for all the numbers that divide to this key). Store the most recently retrieved key and bit array. If the next call divides to the same key, you can look it up in the cache bit array immediately. The number 256 can be tweaked a lot, of course. I'm not sure this is actually faster than just using map without extra gimmicks, so profile it.

For large values of N it's going to depend on lot on the sparseness of the data; clearly if you have max N 10^12 and a lot of the values are in the set, then it's not going to fit into memory anyway.

Related Topic