Java Performance – Super Fast File Storage Engine

concurrencyjavaperformancepostgres

I basically have one big gigantic table (about 1.000.000.000.000 records) in a database with these fields:

id, block_id, record

id is unique, block_id is not unique, it contains about 10k (max) records with the same block_id but with different records

To simplify my job that deals with the DB I have an API similar to this:

Engine e = new Engine(...);
// this method must be thread safe but with fine grained locked (block_id) to improve concurrency
e.add(block_id, "asdf"); // asdf up to 1 Kilobyte  max

// this must concatenate all the already added records added block_id, and won't need to be bigger than 10Mb (worst case) average will be <5Mb
String s = e.getConcatenatedRecords(block_id);

If I map each block to a file(haven't done it yet), then each record will be a line in the file and I will still be able to use that API

But I want to know if I will have any peformance gain by using flat files compared to a well tunned postgresql database ? (at least for this specific scenario)

My biggest requirement though is that the getConcatenatedRecords method returns stupidly fast (not so with the add operation). I am considering caching and memory mapping also, I just don't want to complicate myself before asking if there is an already made solution for this kind of scenario ?

Best Answer

It seems like your "storage system" whatever that is has a very simple abstraction interface. It essentially boils down to "here's an id, gimme data".

So you can easily define this interface and build your entire app on top of it. Behind the scenes you can continue using PostgreSQL like you do today. And if you want to experiment with flat file storage, it should take you more than 1 or 2 days to implement something very simple that reads/writes files straight on disk (my recommendation is to have 1-3 tiers of directories based on first portion of the ID, so you don't have too many files in one flat directory).

If you do that you can compare the performance straight up and see if it's good enough for you.

However, as Euphoric pointed out, most NOSQL stores where introduced and became popular for the very purpose that you are trying to accomplish. I'm not going to recommend a specific store as that's something for you to decide but some advantages they provide are:

  1. managing storage of huge amounts of small entities by buffering and doing writes in larger chunks. In my experience most file systems can work with very large numbers of files, but not in the most efficient way. For example, if you were to try to delete that many files from disk, unless you just reformat the whole disk, it could take multiple hours just for "rm -rf *" command.
  2. If and when you outgrow the limitations of a single physical box, a lot of NoSQL solutions allow you to scale horizontally which will give you a) more storage, b) redundancy of data so if one host goes down, your storage server is still online and c) faster query times since your clients can load balance where they get the info.

Another option to consider is that potentially your storage and your indexing does not have to be implemented in the same system. You could use a separate indexing product like Solr or Elasticsearch and store the actual data in a NoSQL DB (or a straight-up file system)

Related Topic