Efficiently find binary strings with low Hamming distance in large set

algorithmbit-manipulationbitwise-operatorshamming-distance

Problem:

Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.

Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.

Example:

For a maximum Hamming Distance of 1 (values typically will be quite small)

And input: 
00001000100000000000000001111101

The values:
01001000100000000000000001111101 
00001000100000000010000001111101 

should match because there is only 1 position in which the bits are different.

11001000100000000010000001111101

should not match because 3 bit positions are different.

My thoughts so far:

For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.

If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.

How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.

Best Answer

Question: What do we know about the Hamming distance d(x,y)?

Answer:

It is non-negative: d(x,y) ≥ 0
It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
It is symmetric: d(x,y) = d(y,x)
It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)

Question: Why do we care?

Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.

Metric tree (Wikipedia)
BK-tree (Wikipedia)
M-tree (Wikipedia)
VP-tree (Wikipedia)
Cover tree (Wikipedia)

You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.

Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:

static inline int distance(unsigned x, unsigned y)
{
    return __builtin_popcount(x^y);
}

If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.

Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.

Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.

The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.

Database: 100M pseudorandom integers
Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
Results: Average # of query hits (very approximate)
Speed: Number of queries per second
Coverage: Average percentage of database examined per query

                -- BK Tree --   -- VP Tree --   -- Linear --
Dist    Results Speed   Cov     Speed   Cov     Speed   Cov
1          0.90 3800     0.048% 4200     0.048%
2         11     300     0.68%   330     0.65%
3        130      56     3.8%     63     3.4%
4        970      18    12%       22    10%
5       5700       8.5  26%       10    22%
6       2.6e4      5.2  42%        6.0  37%
7       1.1e5      3.7  60%        4.1  54%
8       3.5e5      3.0  74%        3.2  70%
9       1.0e6      2.6  85%        2.7  82%
10      2.5e6      2.3  91%        2.4  90%
any                                             2.2     100%

In your comment, you mentioned:

I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.

I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.

A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.

Related Solutions

Mysql – Hamming distance on binary strings in SQL

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

CREATE FUNCTION HAMMINGDISTANCE(
  A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
  B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(A0 ^ B0) +
  BIT_COUNT(A1 ^ B1) +
  BIT_COUNT(A2 ^ B2) +
  BIT_COUNT(A3 ^ B3);

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.

FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );

Python – Representing and solving a maze given an image

Here is a solution.

Convert image to grayscale (not yet binary), adjusting weights for the colors so that final grayscale image is approximately uniform. You can do it simply by controlling sliders in Photoshop in Image -> Adjustments -> Black & White.
Convert image to binary by setting appropriate threshold in Photoshop in Image -> Adjustments -> Threshold.
Make sure threshold is selected right. Use the Magic Wand Tool with 0 tolerance, point sample, contiguous, no anti-aliasing. Check that edges at which selection breaks are not false edges introduced by wrong threshold. In fact, all interior points of this maze are accessible from the start.
Add artificial borders on the maze to make sure virtual traveler will not walk around it :)
Implement breadth-first search (BFS) in your favorite language and run it from the start. I prefer MATLAB for this task. As @Thomas already mentioned, there is no need to mess with regular representation of graphs. You can work with binarized image directly.

Here is the MATLAB code for BFS:

function path = solve_maze(img_file)
  %% Init data
  img = imread(img_file);
  img = rgb2gray(img);
  maze = img > 0;
  start = [985 398];
  finish = [26 399];

  %% Init BFS
  n = numel(maze);
  Q = zeros(n, 2);
  M = zeros([size(maze) 2]);
  front = 0;
  back = 1;

  function push(p, d)
    q = p + d;
    if maze(q(1), q(2)) && M(q(1), q(2), 1) == 0
      front = front + 1;
      Q(front, :) = q;
      M(q(1), q(2), :) = reshape(p, [1 1 2]);
    end
  end

  push(start, [0 0]);

  d = [0 1; 0 -1; 1 0; -1 0];

  %% Run BFS
  while back <= front
    p = Q(back, :);
    back = back + 1;
    for i = 1:4
      push(p, d(i, :));
    end
  end

  %% Extracting path
  path = finish;
  while true
    q = path(end, :);
    p = reshape(M(q(1), q(2), :), 1, 2);
    path(end + 1, :) = p;
    if isequal(p, start) 
      break;
    end
  end
end

It is really very simple and standard, there should not be difficulties on implementing this in Python or whatever.

And here is the answer:

Best Answer

Related Solutions

Mysql – Hamming distance on binary strings in SQL

Python – Representing and solving a maze given an image

Related Topic