Database – Efficient way to search data based on multiple attributes

Architecturedata modelingdatabasedatabase-designdesign

We are building a system where we accumulate data from many of our internal services, process them and generate set of data called Jobs saved to database. Our client application running on client systems periodically requests for these Jobs and eligible Jobs for the requested client will be sent in the response. For each client request we need to search for qualified Jobs from Jobs table based on client request parameters.

Example:

Jobs:

#1
{
    "id": "1",
   "searchFields":{
      "key1":"value1",
      "key2":"value2",
      "key3":"value3"
   },
   "job": <SOME COMMAND>
}

#2
{
    "id": "2",
   "searchFields":{
      "key3":"value3",
      "key4":"value4",
   },
   "job": <SOME COMMAND>
}

#3
{
    "id": "3",
   "searchFields":{
      "key5":"value5",
      "key6":"value6",
   },
   "job": <SOME COMMAND>
}

How search should work?

We have set of attributes in client request, these attribute's values should match with attribute values in Job's search field "searchField". If client request has attributes "key1", "key2" and "key3" and job's searchField has "key1" and "key2" then this jobs is qualified only if the value of "key1" and "key2" of both client request and Job's searchField matches.

Client Requests:

# 1
{
      "key1":"value1",
      "key2":"value2",
      "key3":"value3"
}
Job #1 is qualified. Job #2 is not qualified because request input does not have "key4".


# 2
{
      "key1":"value1",
      "key2":"some_different_value",
      "key3":"value3"
}
No jobs are qualified because value of key2 doesn't match with any job

# 3
{
      "key1":"value1",
      "key2":"value2",
      "key3":"value3",
      "key4":"value4"
}

Job #1 and #2 are qualified.


# 3
{
      "key2":"value2",
      "key3":"value3",
      "key4":"value4",
      "key5":"value5",
      "key6":"value6",
}

Job #2 and #3 are qualified. Job #1 is not qualified because "key1" does not exists in input data

We have already built prototype for this system using MySQL database, but we feel MySQL is not well suited for these kind of systems. Jobs table is very huge and keeps growing (more than 1,00,000 records added each day) and searching Jobs based on multiple attributes just by using standard SQL queries (without indexed fields) is not efficient. Also, attributes in client request and Job's searchField are dynamic. We don't have fixed set of attributes to work on. New attributes can added or removed anytime, so if we are using SQL queries than handling dynamic attributes would be cumbersome.

What we have tried?

We created combinationKey and hashKey of all key-value attributes in searchField for each Job and saved it to database along with job.

How combinationKey and hashkey are generated,

Job:

{
    "id": "1",
   "searchFields":{
      "key1":"value1",
      "key2":"value2",
      "key3":"value3"
   },   
   "job": <SOME COMMAND>
}

combinationKey="key1::key2::key3"
hashKey=sha256Of("key1=value1::key2=value2::key3=value3")

And save Job as,

{
    "id": "1",
    "combinationKey": "key1::key2::key3",
    "hashKey":<Hash_Key>,
   "searchFields":{
      "key1":"value1",
      "key2":"value2",
      "key3":"value3"
   },   
   "job": <SOME COMMAND>
}

When we receive client request, we fetch unique combination keys from Jobs table (full table scan or cache) and generate hash for incoming request attributes,

Unique combination keys from Jobs table: "key1::key2::key3", "key4::key5" … etc

Client Request:

{
      "key1":"value1",
      "key2":"value2",
      "key3":"value3",
      "key4":"value4"
}

Iterate through all combination keys and generate list of hashKeys for incoming request attributes and once hashKeys are generated, search for Jobs in Jobs table matching these hashKeys.

This approach appears to be working fine in our prototype but I feel it is not efficient enough, because as Jobs data grow, there is a possibility that unique combinationKey count could grow exponentially and for each client request, calculating hashkey for all combinationKeys would be computation intensive.

What we need?

  1. Most efficient and easy way to search Jobs table based
    on incoming client request attributes as per our requirements.
  2. Consistent and accurate search. We do not want to use
    search engines like Elasticsearch as we need accurate data and not for analytics.
  3. Best database suited for these kind of systems.
  4. Any companies (such as facebook, google) already using these kind of systems so we can
    analyze their approach.

Any help would be appreciated.

Best Answer

Your requirements seem to need full text indexing/search abilities which most RDBMs (including MySQL) can do. But if full text searching is the majority of what you will be doing, you might be better off with a noSQL type DB like Solr. Of course you could use both, if you need things that SQL RDBMs does well and noSQL doesn't do well. But from your description of your needs, it sounds like noSQL is the direction to go.

Related Topic