Algorithms – Bloom Filters or Similar Without False Positives

algorithmsdata structuresfiltering

To improve some lookups, I am considering the use of Bloom Filters. But in my use-case, the most probable outcome is that the element do exists in the target set.

Bloom filters can have false positive, but no false negatives. That would make me check the real (big and slow) storage most of the time because of the uncertainties.

Is there another algorithm/data structure with the same properties for space and computation speed (and parallelism of query) that can assure no false positive and a low probability of false negative?

(The max size of the set will be around 300k items, the items will be strings of, at most, 512 characters, and I will have hundreds of sets like that.)

Best Answer

The nice thing about bloom filters is that their space requirement can be scaled arbitrarily, with the cost of more false positives as the size is decreased.

If you want no false positives whatsoever, you can't take probabilistic shortcuts and will not be able to reduce space requirements significantly (which isn't that much of an issue as storing each of your strings sequentially would only use up a few hundred MB per set).

There two important representations of string sets:

  • hash tables, and
  • tries.

These data structures allow parallel access, and each access is O(1) with respect to the number of items in the set.

Tries have the advantage that common prefixes of strings are shared, thus potentially reducing space requirements below that of sequential storage. A read-only trie can also be compressed further.

Related Topic