Algorithm for fast tag search

algorithmsoptimization

The problem is the following.

  • There's a set of simple entities E, each one having a set of tags T attached.
    Each entity might have an arbitrary number of tags.
    Total number of entities is near 100 million, and the total number of tags is about 5000.

So the initial data is something like this:

E1 - T1, T2, T3, ... Tn
E2 - T1, T5, T100, ... Tk
..
Ez - T10, T12, ... Tl

This initial data is quite rarely updated.

  • Somehow my app generates a logical expression on tags like this:

    T1&T2&T3 | (T5&!T6)

  • What I need to is to calculate a number of entities matching given expression (note – not the entities, but just the number). This one might be not totally accurate, of course.

What I've got now is a simple in-memory table lookup, giving me a 5-10 seconds execution time on a single thread.

I'm curious, is there any efficient way to handle this stuff? What approach would you recommend? Is there some common algorithms or data structures for this?

Update

A bit of clarification as requested.

  1. T objects are actually relatively short constant strings. But it doesn't actually matter – we can always assign some IDs and operate on integers.
  2. We definitely can sort them.

Best Answer

i would do this in sql having a link table EntityCategory between eid referencing entity and cid referencing category using self-joins:

    select count(ec1.eid)
    from EntityCategory ec1 
    left join EntityCategory ec2 on ec1.eid=ec2.eid 
    left join EntityCategory ec3 on ec1.eid=ec3.eid 
    ...
    where 
      ec1.cid={categoryId1} and 
      ec2.cid={categoryId2} and
      ec3.cid={categoryId3} ...
Related Topic