Algorithms – Search for Similar Vectors Based on Elementwise Difference

algorithmspattern matching

I understand that there is a thread discussing a similar issue here:

How to efficiently search a set of vectors for a vector which is the closest match

But my problem is slightly different – and hopefully easier.

Given a set of vectors of the same dimension, e.g.,

[2, 3, 5, 9]

[12, 9, 2, 8]

[45, 1, 0, 1]

And a query vector [1, 4, 7, 2], I would like an algorithm that returns [2, 3, 5, 9] as the closest matching vector because it has the lowest sum of elementwise difference, i.e., (1-2) + (4-3) + (7-5) + (2-9) = -5

One way would be to pre-compute the pairwise 'similarity' but the computation is very heavy, given that I will have more than 5000 vectors so 250,000 comparisons.

Best Answer

You can rearrange your:

(1-2) + (4-3) + (7-5) + (2-9) 

...to become:

(1 + 4 + 7 + 2) - (2 + 3 + 5 + 9)

...and (modulo roundoff errors, overflows, etc.) we're assured that the answer remains the same.

In other words, we can store the sum of elements in each vector, and when we get a query, sum the elements in that vector, then search for the closest value among those given.

enter image description here

This will let us do most of the calculations ahead of time much more reasonably.

We can do still better than just searching through all those vectors though. For example, we could create an std::multimap (or std::unordered_multimap) to tell use the vector(s) associated with each sum. With an unordered_multimap, we can expect approximately constant complexity for the lookup.

In the comments, however, you've added a requirement for finding the N most similar vectors. In this case, you probably want to use a std::multimap instead of the unordered variety. With this, the initial lookup will be logarithmic instead of (expected) constant complexity--but finding the next most similar item can be done with constant complexity, because they're ordered.

Depending on your situation, it might also be worth considering using a sorted vector of sums instead. This tends to make substantially better use of cache, since its stores the data contiguously. The tradeoff is that it doesn't support insertion/deletion of elements as nicely. If you do a lot of searching and rarely (or never) change those 5000 vectors, then a sorted vector of sums will probably work best. If, however, they're subject to a lot of changes (i.e., you constantly delete old vectors and insert new ones) then a std::map may be a better choice.

enter image description here

One more point: a multimap uses a binary search to find the closest item. If your differences are reasonably evenly distributed, you can use an interpolating search instead. For that you start by finding the difference between the smallest and largest sum. For example, let's assume the smallest is 4 and the largest is 7912038, and we're looking for the sum closest to 32750. The difference in the sums is 7912034, so we start by computing 32750/7912034 = about 0.004. Then (assuming exactly 5000 vectors) we multiply that by 5000 to get about 20. So, instead of starting at the middle of the array of sums, we can start at the 20th item in the array (and depending on how far wrong that result is, we can repeat the same process until we get to the desired point).

Assuming the sums are reasonably evenly distributed, this lets us find the most similar item with about O(log log N) complexity instead of the O(log N) complexity of a binary search. Of course, for the moment I've used linear interpolation--if we know that the sums have (for example) a logarithmic or exponential distribution, we can use a logarithmic or exponential interpolation. In other words, the sums don't really have to be evenly distributed for this to work well--they just have to be reasonably predictably distributed.

log log N is often called "pseudo-constant"--it does grow with the number of items being examined, but grows so slowly that for most practical purposes it can be treated as constant--for a small number of items like your 5000, it's one. For a large number of items, it grows to 2. Before it gets to 3, you're talking about more data than you can store in all the storage produced on planet earth in its entire history. Before it gets to 4, you're talking about a number larger than the number of atoms in the universe.

Related Topic