Algorithms – Search for Similar Vectors Based on Elementwise Difference

algorithmspattern matching

I understand that there is a thread discussing a similar issue here:

How to efficiently search a set of vectors for a vector which is the closest match

But my problem is slightly different – and hopefully easier.

Given a set of vectors of the same dimension, e.g.,

[2, 3, 5, 9]

[12, 9, 2, 8]

[45, 1, 0, 1]

And a query vector [1, 4, 7, 2], I would like an algorithm that returns [2, 3, 5, 9] as the closest matching vector because it has the lowest sum of elementwise difference, i.e., (1-2) + (4-3) + (7-5) + (2-9) = -5

One way would be to pre-compute the pairwise 'similarity' but the computation is very heavy, given that I will have more than 5000 vectors so 250,000 comparisons.

Best Answer

You can rearrange your:

(1-2) + (4-3) + (7-5) + (2-9)

...to become:

(1 + 4 + 7 + 2) - (2 + 3 + 5 + 9)

...and (modulo roundoff errors, overflows, etc.) we're assured that the answer remains the same.

In other words, we can store the sum of elements in each vector, and when we get a query, sum the elements in that vector, then search for the closest value among those given.

This will let us do most of the calculations ahead of time much more reasonably.

We can do still better than just searching through all those vectors though. For example, we could create an std::multimap (or std::unordered_multimap) to tell use the vector(s) associated with each sum. With an unordered_multimap, we can expect approximately constant complexity for the lookup.

In the comments, however, you've added a requirement for finding the N most similar vectors. In this case, you probably want to use a std::multimap instead of the unordered variety. With this, the initial lookup will be logarithmic instead of (expected) constant complexity--but finding the next most similar item can be done with constant complexity, because they're ordered.

Depending on your situation, it might also be worth considering using a sorted vector of sums instead. This tends to make substantially better use of cache, since its stores the data contiguously. The tradeoff is that it doesn't support insertion/deletion of elements as nicely. If you do a lot of searching and rarely (or never) change those 5000 vectors, then a sorted vector of sums will probably work best. If, however, they're subject to a lot of changes (i.e., you constantly delete old vectors and insert new ones) then a std::map may be a better choice.

One more point: a multimap uses a binary search to find the closest item. If your differences are reasonably evenly distributed, you can use an interpolating search instead. For that you start by finding the difference between the smallest and largest sum. For example, let's assume the smallest is 4 and the largest is 7912038, and we're looking for the sum closest to 32750. The difference in the sums is 7912034, so we start by computing 32750/7912034 = about 0.004. Then (assuming exactly 5000 vectors) we multiply that by 5000 to get about 20. So, instead of starting at the middle of the array of sums, we can start at the 20^th item in the array (and depending on how far wrong that result is, we can repeat the same process until we get to the desired point).

Assuming the sums are reasonably evenly distributed, this lets us find the most similar item with about O(log log N) complexity instead of the O(log N) complexity of a binary search. Of course, for the moment I've used linear interpolation--if we know that the sums have (for example) a logarithmic or exponential distribution, we can use a logarithmic or exponential interpolation. In other words, the sums don't really have to be evenly distributed for this to work well--they just have to be reasonably predictably distributed.

log log N is often called "pseudo-constant"--it does grow with the number of items being examined, but grows so slowly that for most practical purposes it can be treated as constant--for a small number of items like your 5000, it's one. For a large number of items, it grows to 2. Before it gets to 3, you're talking about more data than you can store in all the storage produced on planet earth in its entire history. Before it gets to 4, you're talking about a number larger than the number of atoms in the universe.

Related Solutions

Algorithms – Tineye.com Search Algorithm Explained

The TinEye FAQ reads:

When you submit an image to be searched, TinEye creates a unique and compact digital signature or 'fingerprint' for it, then compares this fingerprint to every other image in our index to retrieve matches. TinEye can even find a partial fingerprint match. Those fingerprints use only the image itself, and not its name or surrounding text. Our fingerprinting technologies were developed by Idée Inc.

TinEye does not typically find similar images (i.e., a different image with the same subject matter); it finds exact matches including those that have been cropped, edited or resized.

Chasing this then to Idee and a google patent search for "idee image search" brings up a number of patents (mostly named "Methods and Systems for Content Processing").

While my digital signal processing background isn't there, these patents do appear to be similar to what TinEye implements... or if not specifically licensed by TinEye, do similar things with other algorithms that accomplish the same end (many of the results appear to be referencing things that TinEye does). The patents are much larger than those I have glanced at before - some reaching over 100 pages.

Unfortunately, neither of the founders of Idée, Inc. come up in the patent search - which is often a valid approach to finding the patents they started with.

Algorithms – Detection of Overlapping Between Vectors

Terminology nitpick: those aren't vectors: vectors have a direction and size, but there is no “starting point” in a vector space. instead, you're defining a line segment via the starting and ending point of this line segment.

Given these two points A and B on the line, we can calculate the direction D = B - A of the line. The line segment we are interested in consist of all points P that fulfill the following equation:

P = A + k·D,

where k is some scalar value in [0, 1]. When you have transformed all your inputs into this form, it's fairly straightforward to find solutions where the lines overlap:

           P1 = P2
=> A1 + k1·D1 = A2 + k2·D2
=>         0 = (A2 - A1) + (-k1·D1 ) + k2·D2

This is essentially a system of linear equations, which can be solved if we have two or more dimensions. (We have to solve for the two variables k1 and k2). There are three possible outcomes:

no solutions exist (remember also the 0 <= k <= 1 constraint), i.e. the line segments do not overlap.
exactly one solution exists, i.e. the lines touch in one point (e.g. by intersecting each other)
there is an infinite number of solutions. This means that the lines are parallel and lie in each other for more than one point.

Best Answer

Related Solutions

Algorithms – Tineye.com Search Algorithm Explained

Algorithms – Detection of Overlapping Between Vectors

Related Topic