Linux – Amazon EC2 + S3 + Python + Scraping – The cheapest way of doing this

amazon ec2amazon-web-serviceslinuxpythonscraping

I tapped in to Amazons AWS offerings and please explain this in high level – if I am thinking right.

So I have few Python scraping scripts on my local machine. I want to use AWS for super fast internet connectivity and cheaper price – win / win!

  • I understand that I can deploy a centOS/Ubuntu instance on EC2.
    Install the necessary Python libraries. Start and stop instances
    using boto (Python) to save costs. Am I thinking right so far? (Is it
    feasible?)

  • I will CRON some scripts that will start fetching (scraping) HTML
    files for parsing later on. So these HTML files are copied over to S3
    for storage (or shall I dump them to my local machine since that is
    how I will be parsing and storing in MySQL?).

Please advise if I make any sense with my assumptions and the little knowledge I have of AWS with my few hours of reading/Googling about the service.

Best Answer

The basic premise of your setup seems fine, however, there are a few items that you may want to factor in.

Firstly, EC2 network (and I/O) bandwidth is dependant on instance type. If you are hoping to use t1.micro instances do not expect 'super fast internet connectivity' - even with an m1.small, you may not see the performance you are looking for. Also, keep in mind that you pay for bandwidth used on EC2 (and not just for instance time).

With regard to your first point, there should be no real difficulty in setting up Python on an EC2 instance. However, the potential difficulty arises from coordinating your instances. For example, if you have 2 instances running, how will you split the task between them? How will each instances 'know' what the other has done (presuming you aren't going to manually partition a list of URLs). Moreover, if you are launching an instance, will one of the EC2 instances be responsible for handling that or will your local machine deal with it (if it is one of the EC2 instances, how do you determine which instance will be responsible for the task (i.e. to prevent the 'launch' task being executed by every instance) and how do you redistribute the tasks to include the new instance? How do you determine which instance(s) to terminate automatically?

Undoubtedly, all of the above are possible (corosync/heartbeat, pacemaker, auto-scaling, etc.) but easy to overlook initially. Regardless, if you are looking for the 'best price' you will probably want to go with spot instances (as opposed to on-demand), however, for that to work, you do need a fairly robust architecture. (It is worth noting that the spot prices fluctuates significantly - at times exceeding the on-demand price; depending on the time-scale over which you are working, you will either want to set a low upper spot price, or determine the best approach (spot/on-demand) on a regular (hourly) basis to minimize your costs.) Although, I can't confirm it at the moment, the simplest (and cheapest) option may be AWS' auto-scaling. You need to setup Cloudwatch alarms (but Cloudwatch does provide 10 free alarms) and auto-scaling itself does not have a cost associated with it (other than the cost of new instances and the Cloudwatch costs).

Given that I really have no idea of the scope of your undertaking, I might ask why not simply use EC2 for the parsing and processing. Especially if the parsing is complex, the pages can be fetched faster than they can be processed, and you have a large number of pages (presumable, otherwise you wouldn't be going to through the effort of setting up AWS), it might be more efficient to simply process the pages on EC2, and when everything is done, download a dump of the database. Arguably, this might simplify things a bit - have one instance running MySQL (with the data stored on an EBS volume), each instance queries the MySQL instance for the next set of records (and perhaps marks those as reserved), fetches and processes, and saves the data to MySQL.

If you are not going to run MySQL on EC2, you can either store your HTML files on S3, as you have mentioned, or can go with saving them on an EBS volume. The advantage of S3 is that you don't need to pre-allocate storage (especially useful if you don't know the size of the data you are dealing with) - you pay for PUTs/GETs and storage; the downside is speed - S3 is not meant to be used as a filesystem, and (even though you can mount it as a filesystem) it would be fairly inefficient to be saving each individual file to S3 (as in you will want to accumulate a few pages and them upload them to S3). Additionally, if you have a large volume of files (tens of thousands) the processing of fetching all the filenames, etc. can be slow. EBS volumes are meant to be used as storage attached to an instance - the advantage is in speed - both transfer rates and the fact it has a 'filesystem' (so reading a list of files, etc. is quick) - EBS volumes persist beyond instance termination (except for EBS root volumes, which do not by default (but can be made to)). The downsides of EBS volumes are that you have to pre-allocate a quantity of storage (which cannot be modified on the fly) - and you pay for that amount of storage (regardless of whether all of it is in use); you also pay for I/O operations (also, the performance of EBS volumes is dependent on network speed - so larger instances get better EBS performance). The other advantage of EBS is that, being a filesystem, you can perform a task like gzipping the files very easily (and I imagine that if you are downloading a lot of html pages you will not want to be fetching individual files of S3 later on).

I am not really going to speculate on the possibilities (keeping in mind that at a very large scale, something like map-reduce/hadoop would be used to manage this kind of task), but as long as you have an approach for partitioning the task (e.g. MySQL instance) and managing the scaling of instances (e.g. auto-scaling), the idea you have should work fine.