C++ – Data retrieval and indexing

cdata structuresindexing

I have around 800,000 rows of data stored in the boost shared memory from the database. The data are in the form:

Id        Color        Length          Size

1        1                 2            4
2        3                 4            5
3        2                 2            0
4        1                 2            4......and so on

The Colors can be the value from 1-12 the length 1-4 and the size 1-5. The Id,Length,Color,Size are stored in seperate vector of 800,000 size in the shared memory. So there is Id vector for Id, Color vector for Color and so on.

I want to filter the data before I perform some computation. So I want data for which Color is 1 and length is 2 and Size 4 i.e row 1 and 4 in above case. Is there any efficient way to filter the data without using for loop and going through all the 800,000 images and checking the condition?

Right now I am just using mysql statement to get the Ids of the data which satisify the condition.

"select Id from features_table where Color=1 and Length=2 and Size =4"

But is there a faster way to do this? Or should I stick to this method? I am looking for a faster method so I am not sure whether fetching the Ids from the database will increase the execution time of the algorithm.

What are the other options that I can consider in this case? I read about Hash table, B-Tree, Binary Search tree and I am confused which is suitable for this case. Will kd-tree be helpful in this case? Because many images may have the same combination of color,length and size. I am not sure if kd-tree is the right thing to do. I read about FLANN in opencv used for kd-tree. Is there any example or resources which may be helpful in this case? Or are there any built in c++ libraries?

Best Answer

You've a lot of data duplication there, as the number of possible combinations of Color, Length and Size values is way too smaller than the number of rows.

A better alternative is storing a set of row indexes for each possible tuple:

struct Rows
{
    std::tuple<Color, Length, Size> properties;
    std::vector<int> rowids;
} ;

This way, queries are straightforward.

If the number of possible tuples is high enough, then it'd be worth to set indexes in the form of row set hashes, one hash per property: std::map<Color, Rows *> and so on.

Related Topic