Database Design – Storing Files Up to 50MB in a Database for Multiple Servers

designfile-storagerelational-databaseserver

I'm in the process of designing a server responsible for serving files that are between 10MB and 50MB in size.
Initially we will run two instances of the server (lets call them fs1 and fs2), with future plans to switch to a micro-service architecture, where the server instances will grow or shrink depending on the load.

These two instances need to interact with a third server running a scheduler and a file management application, as well as a database (on another server) where some metadata will be saved for clients to use.

My initial thoughts where to use a rabbitmq to allow the fs1 and fs2 to communicate with each other and the management app. the process would work as follows:

  1. The management app uploads to fs1 server (could be either fs1 or fs2)
  2. fs1 notifies fs2 and the management app when upload is complete
  3. fs2 contacts fs1 and stores a copy of the file
  4. fs2 notifies the management app when upload is complete
  5. The management app saves metadata to the external database
  6. both fs1 and fs2 can now server the files when requested

This seems OK, if there are only two instances, but once you start adding more it doesn't work.
Our ops department are very much against the idea of using the database to store files. They are worried that it will slow down the system too much. I agree it might, which is why I want a separate database for the specific purpose of storing the files and metadata.
I want to build something like the following:
system diagram

My thinking is that the upload service can manage uploading of files and saving of metadata to the database.
When the scheduler schedules a new job, the upload service (badly named, I know, but I'm not making that image again 🙂 ) can notify the file server instances that they need to cache the required file(s) from the database, which they can access directly.
The file servers won't need to cache more than 5 or 6 files each at a time.
Also, in the diagram I missed that the file management service will receive download progress messages from both file servers.

So to my questions:

  1. Is this a reasonable way to store files of this size for serving?
  2. Is this the right way to be thinking when considering the move to microservices in the future?
  3. Are there advantages to storing the files on the file system of each fs instance instead of just caching?
  4. How can I convince our ops team that storing 50MB files in a database is the way to go? what are the pros and cons?
  5. Any other thoughts or comments appreciated.

Best Answer

NO, don't store files in a relational database

Trust me, I've learned this the hard way. One problem with applications that deal with files, is as they evolve, the users always want to store more than the application was intended to handle.

I once created an application with a document storage component meant to store Word and Excel documents. The storage component was useful enough that eventually people started storing videos in it.

I mention this because, the performance implications will be higher than you expect; this leads me to my next point.

Even if a database can handle files fine (filestream type) scaling a DB is hard, it is always the hardest part to scale. Let the db concentrate on saving and retrieving data, that way you can put off scaling it as long as possible. If your DB is busy serving a large file, those are resources not being used to serve transaction and lookup requests; its bread and butter.

Server to Server synchronization does not scale well

Your system seems over-complicated to me, I would go with a simpler design. The problem with servers fs1 and fs2 talking to each other is, as you scale, the number of paths increases exponentially.

With two servers, each server only has to ask make one synch request, for a total of 2 paths. 3 severs, there are a total of 6. With 5 servers there are 20. synchRequests = (n-1)*(n); n = number of servers

I would simply have a dedicated DB server, and a dedicated File server that the FSn servers talk to to. If you need more complex synchronization behavior, add a dedicated Redis serve in the mix to serve as the single source of truth for non-persistent details.

The point is, don't have fs1 talking to fs2, or vice-versa, this will not scale.

Graph

                             [ fs1 ] [ fs2 ] [ fs3 ] [ ect ]
                                |       |       |       |
                                +-------+---+---+-------+
                                            |
                         +------------------+-------------------+
                         |                  |                   |
                      [ RDB ]           [ Redis ]           [ Files ] 

The best of both worlds?

You can head off most of the disadvantages of storing your files in a RDB, and still get most of the advantages by segregating a completely separate DB instance and storing only your files there. This is a viable option if you don;t want to setup and maintain a file server.

A quick word about microservices

I am not sure why you would want to go the microservices route. The original intent of microservices is to get around political problems, not technical problems. For example, the server admin refuses to open any ports other than 80.

Related Topic