Finding Similar Objects in Large Data Sets – Algorithms and Techniques

algorithmscdatabase

I have a large collection of (C#) objects and these objects have a large number of properties, mostly strings and numbers. This collection is stored in a database. When a new object is about to be added to the database I need to check if similar objects already exist in a database.

One way of doing this would be to enumerate all objects and compare each property for similarity. For strings this could be done using something like the Levenshtein distance, for numerical properties I could take the value of the new object, create an acceptable range and check for that. The problem with this approach is that it would be very slow.

I would imagine that there are solutions to this problem. Something like a "flexible hash"? Can anybody point me in the right direction?

Best Answer

In general, this can be a pretty hard problem, especially when you want to implement it using traditional databases. There are special cases which are easy to solve, you have to check by yourself if your requirements will allow to exploit such cases, or if you need a fully general solution. It makes also a pretty huge difference if you want to check string similarity (with lots of different options of defining what "similar" means), or similarity of single numerical values, or similarity in some 2D or 3D space, or if the allowed range for similarity is bounded by a metrics spanning multiple properties.

If possible, try to pick one property, then use some "fuzzy indexing" technique to find all objects in the collection which are "similar" in regards to these property, and hope this reduces the number of remaining objects to a number small enough that a brute-force search for the remaining properties will run fast enough. This will work best when you can identify a property with a huge range of different values, ideally equally distributed in their value space.

So how might such a "fuzzy index" look like?

  • if the property is just a real number (or a floating point number), use some kind of interval tree for indexing. That will allow to use a binary search to identify the "most similar" values, or all values which are "similar" within a certain range

  • for 2D or 3D points, the canonical choice would be a quadtree or octree. These are 2D and 3D generalizations of the one dimensional interval tree, well known from computer graphics and visualization.

  • Creating an index for approximate string matching is a lot harder, start with the Wikipedia article about this topic, it contains some references to scientific papers about indexing techniques for this class of problems. Maybe this article about fuzzy string matching in PostgreSQL can give you an impression about some possible different ways of defining similarity between strings, but as far as I can tell those PostgreSQL tools don't provide any fast indexing techniques.

For "fuzzy indexing" multiple properties at once, the only reference I could find is this paper, which is not free. Disclaimer: I did not read it, so I have no idea if it will really help you.

Hope this helps.

Related Topic