How to efficiently search a set of vectors for a vector which is the closest match

algorithms

Unfortunately, I am not even sure of the keywords to use for this question, so if this has already been asked please point me the right way.

Given a set of vectors:

[3, 5, 3]
[10, 23, 5]
[123, 53, 97]

(here, 'vector' means one dimensional sets of numeric values).

I'd like to know a way of finding the closest match for any input vector. For example,

[2, 4, 5]

Would return the first one from my list as the closest match. I'd also like to know the distance of the match.

A way of visualising this is like a line graph.

I don't want to linear search. I could have millions of vectors. I do not mind performing pre-processing time; it's search time I want to optimise.

Defining 'closest match'

In this case, let's assume the data is numeric, and the 'closest match' is a simple numeric comparison, giving an absolute distance. For example, comparing [2, 4, 5] could give distances of the test data of:

[1, 1, 2]
[8, 19, 0]
[121, 49, 92]

Varying size vectors

How would vectors of different sizes be handled? I'd like to be able to handle cases such as the following input:

[120, 99]

Matching the third example from the test data, in some "lower scoring" sense. That's because the two values are similar to the first and last values in the test data, ignoring the middle value, and in order.

Another way of describing it might be – the ordering of the values in the vector is important, but not the position.

Clustering solutions

I considered some form of clustering – hashing the vectors into buckets, then hashing the input and returning all vectors in the matching bucket. But what troubles me about this is what if a given input is close to the "boundary" of a cluster – other values in adjacent clusters would not be returned, would they?

Application domain

I want to use this algorithm to find similar musical releases by their constituent track lengths. As such, the lengths can be stored as an integer in some time unit (seconds or milliseconds, say) and the data ends up looking like the above.

So similar to the CDDB (FreeDB) algorithm, but more leniency and the ability to define and explore distance. And support for different length vectors.

Other questions I have looked at

How to find the closest vector to a given vector? seems to discuss 2D points – while this could be considered in a 2D space, the values in each vector must be considered together.

Best Answer

This is a concept map. Since nothing is known about the application domain, no concrete suggestions can be given.

In general, the steps to tackle this problem are:

  • First stage Find out what is a good similarity or distance measure between two feature vectors of same length.
    • There are many choices. Examples:
    • Norm, e.g. L1, L2, L-infinity
    • Cosine
    • Earth mover's distance (EMD)
    • Locality sensitive hashing
    • Random projection
    • While some choices may work better for specific applications, in general a toolbox that implements all of the well-known ones will be able to cover almost all applications. The chance of having to invent a new similarity measure for same-length vector comparison is low.
    • However, there are still some application-specific distance measures:
      • Transformation into frequency domain. This will be explained next.
      • In other words, there are certain types of feature vectors (those that could be called time-series or signals) that will benefit from a unified approach for same-length and different-length vector comparison.
  • Second stage Find out how to measure similarity or distance between two feature vectors of different length.
    • This is highly application-specific.
    • In general, it either falls into signal processing, or pattern recognition.
    • One kind of data that is common to both signal processing and pattern recognition is called time series. The axis doesn't have to correspond to time, though - the axis could be e.g. x-coordinate or y-coordinate on a 2D image, for example.
    • Time series can be transformed into frequency domain, after which comparison can be performed.
    • Time series can also be decomposed into scale space.
    • Time series that "warp" or don't align properly can be compared using dynamic time warping (DTW) which is based on dynamic programming. However, DTW can only be applied to a pair of vectors at once. Unlike the transformations discussed earlier, DTW cannot be used to extract a representation from a single vector.
    • A 2D image should be treated as a time series of time series (i.e. nested). Don't try to rasterize two images of different sizes into vectors of different lengths. Instead, resize the two images into same 2D size, then convert into a 1D vector.
  • Third stage Find out how to speed up large scale searches, such as
    • Pre-process a large number of stored vectors so that queries that ask for the nearest vector will not need to visit every single vector in the store. (This goal is analogous to creating an index for a database table, but different techniques are needed.)
    • Finding clusters from a large number of stored vectors
    • These knowledge typically belongs to information retrieval, and specifically multimedia information retrieval.
    • Extremely simple cases such as 2D/3D points or bounding rectangles could be searched using spatial query. However, it is ineffective in handling multimedia information retrieval, which may need to work with vectors with hundreds to thousands of dimensions.