Database Query – How to Query Key/Value Store Efficiently

databasedatabase-designdatabase-developmentqueryredis

I don't know whether this question is reasonably answerable (and therefore will be closed), but I will take my chances: What are the possible problems (and solutions), one might encounter, when developing query mechanism for key/value store/database (like redis for example)? All the values will consist only of JSON documents and all data will be located only in RAM. Query mechanism should be able to search/query for data based on document field values, so I assume one of the problems might be performance based on number of docs in database (this could however be lowered by partitioning documents based on their type like posts, comments, …).

Best Answer

I am not familiar with Redis, however, some other key/value store databases have the following problems:

Updates are not instantly visible immediately.
SQL joins are to be performed by application.
Some SQL features such as Distinct, Group By are to be performed by application.
No stored procedures, triggers, etc.
No FKs

(2) and (3) above are particulary important because to achieve this you have to move large amounts of data to the client as well as write the logic.

Related Solutions

Database – Why is NoSQL better for this scenario

What advantages would I get from using NoSQL?

NoSQL will scale better as the number of users grows.

Traditional RDBMS don't really scale well. All that you can do is throw bigger machines at the problem. They aren't really suited for distributed systems (cloud e.g.).

NoSQL is (under given circumstances) better at handling hierarchical structures like documents/JSON.

The key point to understand is that these storage mechanisms are key-value based and thus can retrieved data that is stored together very fast, as opposed to data that is "merely related" (what RDBMS were built for).

In your case that would mean, that you can easily retrieve all records for a certain user very fast for example. In traditional relational databases you would either have to denormalize your schema for performance or keep the schema clean but potentially suffer performance penalties caused by joins or heavy aggregations.

Look at it this way: Why is a hash map (key value store) fast? You can retrieve items from a hashmap in almost O(1) as the hash directly translates to a memory address (simplified). Looking up a binary index in contrast to that would yield O(log(n));

For your case, MongoDB or CouchDB might be good solutions, as it's already based on JSON.

In my opinion, using a NoSQL solution here is a good choice. You want to retrieve all the activities of a user as a feed. If they're properly written to your data storage, then NoSQL should, in theory, excell at this, without the need for joining anything or worrying about proper indexes. @Earlz also mentioned that you have no ACID guarantee for NoSQL databases. This makes NoSQL fast and you probably don't need ACID properties for your application. Give it a try!

Moreover, there's a good article from Martin Fowler on the subject. He's made a nice diagram that I really like:

enter image description here

Go check out his pages to read some deep thoughts about NoSQL.

Caching – Redis Write-Through Cache vs Write-Behind Cache

Your question is specific, so I'll answer it head-on first.

I anticipate that "write-behind" cache will be better. "write-through" might help with occasional spikes in ingress data, but won't alleviate the problem of disk throughput (and transaction throughput).

Of course, this would need to be verified in real-life with benchmarking.

My recommended implementation

I strongly recommend that you don't overengineer your architecture. Less is more.

YAGNI: Start simple and tunable, then update your architecture as needed.

1. Bypass the Java and Redis, and write directly to the database

see https://colossal.gitbook.io/microprocess/definition/data-web-gateway. This approach lets you write directly to the database over HTTP.
Choose MariaDB(MySQL) as the database and choose the InnoDB engine; I am quite sure that it generally has higher throughput than PostgreSQL, and supports a higher number of concurrent connections (~200k vs 500) - but you should measure that yourself using the target hardware/VM that you plan to use. (see [Compatible Database Performance Comparisons] in https://colossal.gitbook.io/microprocess/database-system/introduction)
Have a table with the available Database-Web-Gateway endpoints (that you can add to in the future) and have your client connect to a random one; select * from View_DataWebGateways order by RAND() limit 1;
Use a writeable view so that you can point straight to the posts table today, and be able to change that in the future. Query something like: insert into WriteView_Posts (...) values (...). The same goes for PostComments and PostReactions
(Install MySQL and the Database Web Gateway on the same VM)

2. Optimise DB and Hardware

Right off the bat, you need a DBA and SysOps person who you can call on to scale up your DB and VM/Hardware and stay ahead of your growth curve. You'll probably find that starts happening in say 2 years anyway. If it happens sooner, your business is doing great and you can now hire more programmers yay!

3. Logical Sharding

With posts, you have a natural shard topic - the post. You won't have PostComments that belong to two posts, they will have an affinity for the Post. You also have Geo affiliated data, so I am guessing that people close to those locations will want that data more - with Sharding you can master that data in a closer data center.

Create a "Management" database to hold a Shards table, as well as the DataWebGateways table. This will be a replicated database with one master, and each new shard will have a read-replica of this.

For each new shard:

Create a new VM, create a new Database with a numbered scheme "Posts_1", "Posts_2". (there is NO need to replicate these databases)
Make sure you have a read-replica of the Management database to this MariaDB instance.
Insert the relevant Shard and DataWebGateway records into the master instance (manually I guess).

Now, you can make the View_DataWebGateways view a bit smarter. Let the client get all of the records, so that it can randomly choose one to use, while pinging the others to see which is closest (by latency), then switch to that.

4. Staging Table

It's probably best that you just create another shard, but if you do testing, and find that batching greatly improves insert performance, then you can do the following:

Change WriteView_Posts to point to PostStaging instead of Posts table.
Benchmark different MariaDB engines, and select the right engine and configuration for the PostStaging table that optimise for Insert performance. This table can be configured to a dedicated Disk Volume that is also benchmarked to perform best for the simulated load (sequential reads).
Ensure you don't have any constraints or triggers on this PostStaging table. (While Posts can)
Create a Timed-Interval Microprocess which batches inserts into Posts from PostStaging every X seconds.

Now, if the power goes out, you don't have to lose your "cached" posts. This also helps because you can also redirect posts to the appropriate shard by geography at this point.

5. "High load" probably means Reads not writes

Add more VM/hardware: create Read-Replicas of Shards within the same datacentre, and add DataWebGateways records with ReadOnly flag for the client to use. The client will do all reading from read-replicas, and writing to the single write master DWebG.

6. Curate feeds with FeedViews

Avoid caching at all costs - https://colossal.gitbook.io/microprocess/building/caching

If you would like to curate specific geographic feeds (eg. Germany, France, ...). While a geography shard server is great, Germans might want a mix of posts from around the world, while the French only want French posts - that's up to you to figure out.

To make this work:

Have a crude view View_CountryFeed and use it like select * from View_CountryFeed order by post_id desc limit 10;. This will only work on a single shard to begin with.

When that is reaching its limits, improve your algorithm and also introduce manual materialisation with the following:

Log interest in posts per country
Run your algorithms as microprocesses
The algorithm parameters will be shared on the Management replicated database
The algorithm will run in isolation on each Shard grouping to Country IDs
The Management database will be configured on which shard to store the feed
A microprocess will pull country feeds from other shards to aggregate. Each shard will pull the right CountryIDs as configured. "Pulling" is done on the resource-sensitive side so this can be tuned per-shard to minimise resource usage.
Update View_CountryFeed: having PostID, PostBodyJSON, ShardID. The ShardID is important because that's where post interactions will be sourced from and published, while the PostBodyJSON will enable efficient rendering of the post by the client.

7. Realtime Notifications

Database Web Gateway will be able to help here to in the future. see https://colossal.gitbook.io/microprocess/database-system/feature-gaps. When the CountryFeed table is updated, a trigger will fire and notify the DWebG which will rerun subscribed Views and send new records to the subscribed web clients.

Best Answer

Related Solutions

Database – Why is NoSQL better for this scenario

Caching – Redis Write-Through Cache vs Write-Behind Cache

Related Topic