Amazon Web Services – Deploy AWS Without Deleting Files

amazon ec2amazon-web-services

I'm new to AWS and I'm having trouble with deploying a project. Every time I deploy using the CLI, all the files created by my application is wiped.

Now I'm sure I'm just falling victim to my own incompetence, but I'm having a hard time tracking down the right process/design to ensure that I can keep some parts of my data while updating. I suspect I have to save the data to another location outside of the local server, but I'm not clear on how to approach that.

Any pointers would be appreciated.

Best Answer

Generally it's better to treat your instances as temporary as this makes scaling, backups, etc simpler. "Cattle not pets" is the general principal.

To enable this approach put your data onto a persistent data store such as:

EFS (which you can map as a drive from all ECS instances)
Shared EBS volume is an option (EFS typically better)
RDS SQL database
DynamoDB NoSQL database
S3 object store (which you can map as a drive with the right software)

Related Solutions

PostgresQL on Amazon EBS volume, realistic performance, or move to something more lightweight

@Gnanam's link points to some good advice, particularly this description of a working setup. I see no reason to avoid using EBS, but treat an EBS volume as you would a single hard drive in a real server: prone to failure. Thus, you'll want a RAID level with good resistance to failure, so not RAID 0. And given your requirements, you want a RAID level that's also fast on write. So RAID 10 across 6-10 volumes seems like the best place to start.

As for actual performance, it's going to depend on your indexing requirements and the size and type of data you're inserting. The great thing about AWS is that it's relatively cheap to find out how a certain configuration will perform. So what you'll need to do is to come up with some sample data and way to simulate the incoming feed you're trying to process (a script that inserts the records one at a time and writes a log statement with a timestamp every X number of rows, for example). It's probably okay if the sample data repeats over time for your purposes, but make sure your script can run for an hour or more at least.

Now, run that script against a postgresql database set up on various EBS configurations, using snapshotting or Amazon's new Cloud Formation service to produce some reliably reproducible starting points, and measure the performance changes as you change the configuration (and over time will be important as well). You might want to toss in single-volume and RAID5 configurations just to compare.

Most efficient (time, cost) way to scrape 5 million web pages

Working on the assumption that download time (and therefore bandwidth usage) is your limiting factor, I would make the following suggestions:

Firstly, choose m1.large instances. Of the three 'levels' of I/O performance (which includes bandwidth), the m1.large and m1.xlarge instances both offer 'high' I/O performance. Since your task is not CPU bound, the least expensive of these will be the preferable choice.

Secondly, your instance will be able to download far faster than any site can serve pages - do not download a single page at a time on a given instance, run the task concurrently - you should be able to do at least 20 pages simultaneously (although, I would guess you can probably do 50-100 without difficulty). (Take the example of downloading from a forum from your comment - that is a dynamic page that is going to take the server time to generate - and there are other users using that sites bandwidth, etc.). Continue to increase the concurrency until you reach the limits of the instance bandwidth. (Of course, don't make multiple simultaneous requests to the same site).

If you really are trying to maximize performance, you may consider launching instances in geographically appropriate zones to minimize latency (but that would require geolocating all your URLs, which may not be practical).

One thing to note is that instance bandwidth is variable, at times you will get higher performance, and at other times you will get lower performance. On the smaller instances, the variation in performance is more significant because the physical links are shared by more servers and any of those can decrease your available bandwidth. Between m1.large instances, within the EC2 network (same availability zone), you should get near theoretical gigabit throughput.

In general, with AWS, it is almost always more efficient to go with a larger instance as opposed to multiple smaller instances (unless you are specifically looking at something such as failover, etc. where you need multiple instances).

I don't know what your setup entails, but when I have previously attempted this (between 1 and 2 million links, updated periodically), my approach was to maintain a database of the links adding new links as they were found, and forking processes to scrape and parse the pages. A URL would be retrieved (at random) and marked as in progress on the database, the script would download the page and if successful, mark the url as downloaded in the database and send the content to another script that parsed the page, new links were added to the database as they were found. The advantage of the database here was centralization - multiple scripts could query the database simultaneously and (as long as transactions were atomic) one could be assured that each page would only be downloaded once.

A couple of additional points of mention - there are limits (I believe 20) on the number of on-demand instances you can have running at one time - if you plan to exceed those limits, you will need to request AWS to increase your account's limits. It would be much more economical for you to run spot instances, and to scale up your numbers when the spot price is low (maybe one on-demand instance to keep everything organized, and the remaining, spot instances).

If time is of higher priority than cost to you, the cluster compute instances offer 10Gbps bandwidth - and should yield the greatest download bandwidth.

Recap: try few large instances (instead of many small instances) and run multiple concurrent downloads on each instance - add more instances if you find yourself bandwidth limited, move to larger instances if you find yourself CPU/memory bound.

Best Answer

Related Solutions

PostgresQL on Amazon EBS volume, realistic performance, or move to something more lightweight

Most efficient (time, cost) way to scrape 5 million web pages

Related Topic