Ftp – How to share FTP’d files between multiple instances on AWS

amazon ec2amazon-web-servicesftp

I currently have a system setup where a client FTPs me files which triggers inotify (via Linux kernel notifications) to trigger a parse to take action on those files. The problem I'm running into is the parser is currently hitting I/O capacity on one EC2 instance and I'd like to add additional nodes to work on splitting the file load. The client unfortunately can only upload via FTP. This leaves me wondering how can I have another instance, that doesn't share the EBS volume that the files are being dropped on, work on that file.

Is there currently an AWS solution that will keep my client who uses FTP not touched (besides maybe an IP change which is fine) and allow me to have multiple EC2 instances access the filesystem?

Best Answer

Of course...

You can attach and stripe across multiple volumes of either type to increase the I/O performance available to your Amazon EC2 applications.

http://aws.amazon.com/ebs/

That's one thing I have used, RAID-10 of EBS volumes, but...but I assume you've thought of that one.

I thought about suggesting scaling your FTP server using something like HAProxy and/or the redir utility that's bundled with Ubuntu (which can rewrite FTP packets to fix some of the inherent absurdity in that protocol) but the awkward multiple-connection nature of FTP could make that a complicated proposition, and it might not really be what you want.

So, what about s3fs?

Before I suggested this, I googled and found things like this post, which suggested it might not work, but then I realized the OP in that case seems to have had a signficant lack of understanding of how S3 and filesystems actually work, and was expecting inotify to realize that things had changed remotely in S3 via external causes (having not traversed the local filesystem) which of course makes no sense.

But I wrote some code to test it, and s3fs does indeed appear to interact correctly with inotify. You could mount a bucket, instead of an EBS volume, from your FTP server, so that when your client uploads the files via FTP, they drop directly into the bucket -- and inotify catches that event as it would with a traditional filesystem, at which point you could use SQS or any number of other mechanisms to alert the worker machines that there were jobs to be done. They could then fetch and process the files independently, with I/O being limited only by the available bandwidth between each of those machines and the S3 infrastructure.

There are a number of things that s3fs is entirely inappropriate for, such as a server that's serving up the same static content over and over -- s3fs is not a good solution for that, because of the large number of redundant requests that would be likely to occur and/or the need to s3fs to cache things locally (which it can, but there's no point -- if you need that, then you'd just store the files locally), and the latency involved in fetching them individually on demand while trying to serve up a responsive web site could be problematic... but when each file is not being accessed over and over again, I've had positive results with it.

I recently did a small project for a client where they wanted to store publicly-downloadable assets in S3 but they had perhaps a similar constraint to yours -- they really wanted to be able to upload the files using FTP. Combining proftpd with a bucket mounted to an EC2 instance via s3fs provided them with an easy "gateway" into S3 that was compatible with their existing systems... so I do know that it does work, and having tested that same setup with inotify just now, I can tell you that the two seem to have the expected interaction.

Using S3 from inside EC2 like this, the storage price is essentially equivalent to EBS and you would not pay for bandwidth if the bucket is in the same region as your endpoint -- you'd only pay for each PUT ($5 per million requests) and GET ($4 per million requests) (my interpretation of the pricing tables; I have millions of objects stored in S3 and have never had a billing surprise, but don't take my word for it). There will possibly not be precisely a 1:1 correlation of files and requests, since s3fs has to do some background monkey-business to store the file mode and ownership in S3 as part of its pseudo-filesystem emulation, and has to iterate objects to generate directory listings, so YMMV on the requests... but it seems like a viable solution.

As long as you go into it with the proper understanding of the impedance mismatch between what S3 does is and what a traditional filesystem does, I don't see why this wouldn't scale you pretty much as infintely as you for it need to.

Of course my favorite part of s3fs is that you never run out of space. :)

Filesystem      Size  Used Avail Use% Mounted on
s3fs            256T     0  256T   0% /var/xxxxxxxxxxx