AWS EC2 – Shared File Systems Between Multiple Instances

amazon ec2amazon-web-services

I have a couple of windows server instances running on Amazon EC2 and would like to make them a bit more fault tolerant by running a duplicate instance with load balancers.

The problem is the specific data, as an example it does no good to fail over from one web server to another web server if the contents of the document root i.e. C:/htdocs/ (Apache) or C:/Repositories (VisualSvn Server) are not identical.

Is there a way to share a volume across two or more instances?

My idea is share folder between EC2 istances:

enter image description here

read it's not possible to attach the same EBS volume to multiple instances. I believe also AWS is not NFS friendly either in case I want to mount them across NFS.

And finally, I've also checked S3 bucket mounted with s3fs but I found out it's not a good option too.

Can anyone help point me in the right direction?

Best Answer

It is not possible to share a single EBS volume between multiple EC2 instances.

Your diagram is offloading the data to a shared server. However, this shared server is simply another single-point-of-failure. So you're not saving yourself anything: if the AZ of that server goes down, then you've lost the data, even if the web server/VisualSVN server in another AZ is still running.

You should split your server between it's two separate functions into two separate servers/clusters so they can be handled independently of each other:

web server, and
VisualSVN server

For the web server, do you really need to mirror the volume in a multi-instance scenario, or can you keep your instances anytime-terminatable without data loss? Ideally, you would not save any data locally to the instance. Instead, you would save all data off-server to a database or to Amazon S3. This way, the data is available to all instances, all the time. If the server is lost, none of the data is. Make your "master" AMI and create all instances in an auto-scaling group from that master AMI. When your web server code changes, deploy a new AMI, terminate the old instances and create new ones from the new AMI.

For the VisualSVN server, the question to ask is whether VisualSVN can handle volume data changing on it without the running process caring about it. For example, if the running process caches some data in RAM, what happens if some hard drive sync process comes along behind it's back and changes the hard drive on it? It could be that the VisualSVN server simply is not able to handle a multi-instance scenario. Depending on the answer to that, you may not be able to cluster the VisualSVN server. It's possible that VisualSVN server has it's own clustering feature. If so, then you should investigate that.

Related Solutions

Linux – AWS-EC2: Shared file system for high availability

As cyberx86 says, EBS volumes cannot be mounted on multiple EC2 instances (even in the same Availability Zone).

The first answer should be to store your shared assets in Amazon S3 - that way, you can deploy your code via Capistrano/Mcollective/whatever directly to both your live and standby EC2 instances, and completely offload your static content (i.e. images, media) to S3, perhaps even with CloudFront providing edge caching.

That said, S3 doesn't do cross-region replication (EU-West-1 to US-East-1), however it does offer "four nines" (99.99%) of availability within a region, so a full region-wide failure is unlikely. For a 'belt and braces' approach you may want to configure a cron'd synchronisation process between S3 buckets in two different regions - take a look at the s3cmd documentation with the --sync flag.

If porting your assets to S3 is too much of a pain, and if your failover mechanism is hot-standby (i.e. you have an always-ready clone in another region and manual failover), you can configure a cron'd rsync run to keep your non-version-controlled assets synchronised (as before, you should always release your app code to all servers.

Clustered filesystems (e.g. GlusterFS, GFS2) or block-level replication (e.g. DRBD) isn't really recommended with EC2 (or at least not unless you shell out for instances with guaranteed NIC bandwidth such as the cluster networking range). S3FS has proven to be painfully slow, due to the fact that every IO request on the filesystem has to be backed with an S3 API call - details here: (1), (2).

You can run into network congestion caused by other tenants (or even create congestion yourself) - these types of solutions are best fitted to environments where you control (or at least have influence over) the entire stack.

Ftp – How to share FTP’d files between multiple instances on AWS

Of course...

You can attach and stripe across multiple volumes of either type to increase the I/O performance available to your Amazon EC2 applications.

^{— http://aws.amazon.com/ebs/}

That's one thing I have used, RAID-10 of EBS volumes, but...but I assume you've thought of that one.

I thought about suggesting scaling your FTP server using something like HAProxy and/or the redir utility that's bundled with Ubuntu (which can rewrite FTP packets to fix some of the inherent absurdity in that protocol) but the awkward multiple-connection nature of FTP could make that a complicated proposition, and it might not really be what you want.

So, what about s3fs?

Before I suggested this, I googled and found things like this post, which suggested it might not work, but then I realized the OP in that case seems to have had a signficant lack of understanding of how S3 and filesystems actually work, and was expecting inotify to realize that things had changed remotely in S3 via external causes (having not traversed the local filesystem) which of course makes no sense.

But I wrote some code to test it, and s3fs does indeed appear to interact correctly with inotify. You could mount a bucket, instead of an EBS volume, from your FTP server, so that when your client uploads the files via FTP, they drop directly into the bucket -- and inotify catches that event as it would with a traditional filesystem, at which point you could use SQS or any number of other mechanisms to alert the worker machines that there were jobs to be done. They could then fetch and process the files independently, with I/O being limited only by the available bandwidth between each of those machines and the S3 infrastructure.

There are a number of things that s3fs is entirely inappropriate for, such as a server that's serving up the same static content over and over -- s3fs is not a good solution for that, because of the large number of redundant requests that would be likely to occur and/or the need to s3fs to cache things locally (which it can, but there's no point -- if you need that, then you'd just store the files locally), and the latency involved in fetching them individually on demand while trying to serve up a responsive web site could be problematic... but when each file is not being accessed over and over again, I've had positive results with it.

I recently did a small project for a client where they wanted to store publicly-downloadable assets in S3 but they had perhaps a similar constraint to yours -- they really wanted to be able to upload the files using FTP. Combining proftpd with a bucket mounted to an EC2 instance via s3fs provided them with an easy "gateway" into S3 that was compatible with their existing systems... so I do know that it does work, and having tested that same setup with inotify just now, I can tell you that the two seem to have the expected interaction.

Using S3 from inside EC2 like this, the storage price is essentially equivalent to EBS and you would not pay for bandwidth if the bucket is in the same region as your endpoint -- you'd only pay for each PUT ($5 per million requests) and GET ($4 per million requests) (my interpretation of the pricing tables; I have millions of objects stored in S3 and have never had a billing surprise, but don't take my word for it). There will possibly not be precisely a 1:1 correlation of files and requests, since s3fs has to do some background monkey-business to store the file mode and ownership in S3 as part of its pseudo-filesystem emulation, and has to iterate objects to generate directory listings, so YMMV on the requests... but it seems like a viable solution.

As long as you go into it with the proper understanding of the impedance mismatch between what S3 does is and what a traditional filesystem does, I don't see why this wouldn't scale you pretty much as infintely as you for it need to.

Of course my favorite part of s3fs is that you never run out of space. :)

Filesystem      Size  Used Avail Use% Mounted on
s3fs            256T     0  256T   0% /var/xxxxxxxxxxx

Best Answer

Related Solutions

Linux – AWS-EC2: Shared file system for high availability

Ftp – How to share FTP’d files between multiple instances on AWS

Related Topic