PostgresQL on Amazon EBS volume, realistic performance, or move to something more lightweight

amazon-web-servicesmongodbpostgresql

I'm working on a little research project, currently running as an instance on ec2, and I'm hoping to figure out whether I'm going down the right path. We, like a thousand other people, are making use of some of twitters streaming feeds to do gather some data to have fun with and my db seems to be having problems keeping up, and queries take what seems to be a very long time. I'm not a DBA by trade, so I'll just dump some info here and add more if need be.

System specs:

ec2 xl, 15 gigs of ram

ebs: 4 100 gb drives, raid 0.

The stream we're getting we're looking at around 10k inserts per minute.

3 main tables, with the users we're tracking somewhere in the neighborhood of 26M rows currently.

Is this amount of inserts on this hardware too much to ask out of ebs? Should take a look at some things with less overhead like mongodb?


Best Answer

@Gnanam's link points to some good advice, particularly this description of a working setup. I see no reason to avoid using EBS, but treat an EBS volume as you would a single hard drive in a real server: prone to failure. Thus, you'll want a RAID level with good resistance to failure, so not RAID 0. And given your requirements, you want a RAID level that's also fast on write. So RAID 10 across 6-10 volumes seems like the best place to start.

As for actual performance, it's going to depend on your indexing requirements and the size and type of data you're inserting. The great thing about AWS is that it's relatively cheap to find out how a certain configuration will perform. So what you'll need to do is to come up with some sample data and way to simulate the incoming feed you're trying to process (a script that inserts the records one at a time and writes a log statement with a timestamp every X number of rows, for example). It's probably okay if the sample data repeats over time for your purposes, but make sure your script can run for an hour or more at least.

Now, run that script against a postgresql database set up on various EBS configurations, using snapshotting or Amazon's new Cloud Formation service to produce some reliably reproducible starting points, and measure the performance changes as you change the configuration (and over time will be important as well). You might want to toss in single-volume and RAID5 configurations just to compare.