Linux – valid stability argument against NFS

linuxnfs

We are adding a feature to our web app where uploaded files (to app servers) are processed by background workers (other machines).

The nature of the application means these files stick around for a certain amount of time. Code executing on the worker knows when files become irrelevant and should delete the file at that time.

My instinct was to ask our sysadmins to set up a shared folder using NFS. Any webserver can save the file into the NFS, and any worker can pick it up to work on it. Signalling & choreographing work occurs via data in a shared Redis instance.

About the NFS, I was told:

Typically, for this kind of use case, we route all upload requests to
a single web server. The server that handles uploads will write the
files to a directory, say /data/shared/uploads which is then
synchronized in a read-only fashion to all other servers.

It sounded like they didn't like NFS. I asked what the problems were. I was told:

In regards to NFS or any other shared file system, the problem is
always the same – it introduces a single point of failure. Not only
that, it also tightly couples all servers together. Problems with one
server can affect the others, which defeats the purpose of load
balancing and de-coupling.

We are currently at the scale where we have multiple web servers and workers, but still single DB and Redis instances. So we already have single points of failure that we are tightly coupled to.

Is NFS so problematic that the above arguments are valid?

Best Answer

NFS background

NFS is fine while it works, but has many issues as NFS is protocol which is 31 years old. Of course there are new version, which fix something, but brings other issues with them.

The main issue is how NFS fails. As both NFS client and server are kernel-based, most of NFS outages result in rebooting of the whole server. In soft mode, any fs operation (read/write/mkdir/...) can fail in the middle of something and not all applications are able to handle that. For that reason many times NFS is run in hard mode, which means these operations can hang forever (accumulating more and more hanging processes). Reasons for failing are for example short temporary network outages, configuration errors and so on. Also instead of failing it can slow down everything.

If you choose NFS for any reason, you should use it in TCP mode, as in UDP over 1 Gbit/s and faster data corruption is very likely to occur (man page warns about it also).

Other options

What I would suggest - if you really don't need NFS, don't use it. I'm not aware of any from the TOP websites (FB, Google, ...) which would be using NFS as usually for web there are better ways of achieving this.

The solution with synchronizing mentioned in the question itself is fine, usually you can live with few seconds of delay. You can for example serve the file to the uploader (who expects it to be live) from the webserver where it was uploaded. So he sees it instantly and other users will see it after 1 minute when sync job runs.

Another solution is to store the files in database, which itself can be replicated if needed. Or use some distributed storage like Amazon S3.

In your example you can also store the files on webservers in protected folder and workers would fetch them via HTTP when they want to process them. There would be database table with info about files and their location.