Algorithm for fast tag search

algorithmsoptimization

The problem is the following.

There's a set of simple entities E, each one having a set of tags T attached.
Each entity might have an arbitrary number of tags.
Total number of entities is near 100 million, and the total number of tags is about 5000.

So the initial data is something like this:

E1 - T1, T2, T3, ... Tn
E2 - T1, T5, T100, ... Tk
..
Ez - T10, T12, ... Tl

This initial data is quite rarely updated.

Somehow my app generates a logical expression on tags like this:

T1&T2&T3 | (T5&!T6)
What I need to is to calculate a number of entities matching given expression (note – not the entities, but just the number). This one might be not totally accurate, of course.

What I've got now is a simple in-memory table lookup, giving me a 5-10 seconds execution time on a single thread.

I'm curious, is there any efficient way to handle this stuff? What approach would you recommend? Is there some common algorithms or data structures for this?

Update

A bit of clarification as requested.

T objects are actually relatively short constant strings. But it doesn't actually matter – we can always assign some IDs and operate on integers.
We definitely can sort them.

Best Answer

i would do this in sql having a link table EntityCategory between eid referencing entity and cid referencing category using self-joins:

    select count(ec1.eid)
    from EntityCategory ec1 
    left join EntityCategory ec2 on ec1.eid=ec2.eid 
    left join EntityCategory ec3 on ec1.eid=ec3.eid 
    ...
    where 
      ec1.cid={categoryId1} and 
      ec2.cid={categoryId2} and
      ec3.cid={categoryId3} ...

Related Solutions

Algorithms – Detecting Duplicate Articles or Posts

There are many algorithms that deal with document similarity in NLP. Here's a seminal paper describing various algorithms. Also wikipedia has a larger collection. I favor the Jaro Winkler measure and have used it for grad school projects in aglomerative clustering methods.

Algorithms – Generating N Random Numbers Between A and B Summing to X

As said before, this question actually doesn't have an answer: The restrictions imposed on the numbers make the randomness questionable at best.

However, you could come up with a procedure that returns a list of numbers like that:

Let's say we have picked the first two numbers randomly as -0.8 and -0.7. Now the requirement is to come up with 3 'random' numbers that sum up to 2.5 and are all in the range [-1,1]. This problem is very similar to the starting problem, only the dimensions have changed. Now, however, if we take a random number in the range [-1,1] we might end up with no solution. We can restrict our range to make sure that solutions still exist: The sum of the last 2 numbers will be within the range [-2,2]. This means we need to pick a number in the range [0.5,1] to make sure we can reach a total of 2.5.

The section above describes one step in the process.

In general: Determine the range for the next number by applying the range of the rest of the numbers to the required sum. Pseudo-code:

function randomNumbers (number, range, sum) {
  restRange = range * (number - 1)
  myRange = intersection ([sum - restRange.upper, sum - restRange.lower], range)

  myNumber = random(myRange)

  rest = randomNumbers (number-1, range, sum - myNumber)

  return [myNumber, rest]
}

So for the case described above

randomNumbers (3, [-1,1], 2.5)
  restRange = [-1,1] * (3-1) = [-2,2]
  myRange = intersection ([2.5-2,2.5-(-2)], [-1,1]) = intersection ([0.5,4.5],[-1,1]) = [0.5,1]

A quick-and-dirty implementation in Java:

public class TestRandomNumberList
{

    @Test
    public void test()
    {
        double[] numbers = new double[5];
        randomNumbers(numbers, 0, -1, 1, 1);
        assertEquals(sum(numbers), 1.0, 0.00001);
        for (double d : numbers)
        {
            assertTrue(d >= -1 );
            assertTrue(d <= 1);
        }
    }

    private void randomNumbers(double[] numbers, int index, double lowerBound, double upperBound, double sum)
    {
        int next = index + 1;
        if (next == numbers.length)
        {
            numbers[index] = sum;
        }
        else
        {
            int rest = numbers.length - next;  

            double restLowerBound = lowerBound * rest;
            double restUpperBound = upperBound * rest;

            double myLowerBound = Math.max(lowerBound, sum - restUpperBound);
            double myUpperBound = Math.min(upperBound, sum - restLowerBound);

            numbers[index] = random(myLowerBound, myUpperBound);
            randomNumbers(numbers, next, myLowerBound, myUpperBound, sum - numbers[index]);
        }
    }

    private double random(double myLowerBound, double myUpperBound)
    {
        double random = Math.random();
        return myLowerBound + random * (myUpperBound - myLowerBound);
    }

    private double sum(double[] numbers)
    {
        double result = 0;
        for (double num : numbers)
        {
            result += num;
        }
        return result;
    }

}

Best Answer

Related Solutions

Algorithms – Detecting Duplicate Articles or Posts

Algorithms – Generating N Random Numbers Between A and B Summing to X

Related Topic