Distributed Systems – Store File in Filesystem and Metadata in Database Atomically

databasedistributed-systemfile-systems

I have to store many pdf/jpg/png file of max 10mb in a filesystem, and need to save their metadata on a database.

The SFTP and the DB may be on different nodes. On WS, I've a local db where I can check the login and ask the address to the database and the filesystem.

enter image description here

I was wondering: what happens if DB fails just before I've uploaded the file to the SFTP? Or, even worst, the WS fails before inserting data to the DB.
Since I've the constraint to return the id of the new insert I can't defer the insert, how is usually deal this kind of system?

Best Answer

Here's a partial, but more practical and likely applicable, answer. If follow-on processes only consider files via their metadata entries, then the file upload and the update to the metadata need not be atomic. This means you can't have a process that processes every file in the SFTP directory. It would instead need to fetch the list of files from the metadata table and process each file in the resulting list. Similarly, you can't have a process that checks for a specific file in the directory; it would instead check for an entry in the metadata table.

In this context, you can simply upload the file and, once the upload has been verified, insert a metadata record. If the process fails at any point before the metadata insert commits, you just end up with a harmless orphan file. This assumes that file names are distinct; however, if you always go through the metadata table, the actual file names in the SFTP directory no longer matter, so you can just tack a UUID to the end of them to guarantee they are distinct.

You may want to clean out orphans (particularly as "overwriting" a file in this context simply means orphaning the old file). This can be done in a background process that deletes files older than some time horizon, which can likely be at least 24 hours and quite possibly on the order of months.

If the assumptions don't apply, then something like Christophe's approach becomes necessary.

Related Topic