Web-server – How to estimate IO requirements


Given the following parameters, how can I estimate disk subsystem requirements?

The environment is a public webserver for delivery of large files to many concurrent users.

  • Total file set size (in GB)
  • Total file set size (# files)
  • Working file set size (in GB)
  • Working file set size (# files)
  • Average file size
  • Average client download speed
  • Expected concurrent clients

Despite the fact these are large files, I imagine I should estimate for somewhere between pure random IO and pure sequential IO. With faster/fewer clients I guess it will tend towards sequential, and with slower/more clients it would tend towards random. Hopefully that's kinda correct?

So my thinking is to first calculate "expected IOPS." This is what I'm stuck on. I'm assuming I should be able to get close using the following parameters: Working set size, Average Client Speed and Expected Concurrent Clients.

From there, I can look at the IOPS ratings of disks and RAID controllers and come up with a rough estimate of the disk subsystem required to serve the file set to that many users.

Obviously there is more to it such as read-ahead and the amount of RAM available for caching, as well as filesystem blocksize, RAID stripe-width, etc. but I figure if I base it off 0 read-ahead and 0 RAM, that should give me a rough pessimistic estimate.

Can someone with experience in this field please let me know if I'm on the right track, and/or offer any advice on how to calculate some of these values?

If there are sites that discuss this or books I can buy, I'm very willing to do so but I've been searching for 2 days with not very much luck. I'm a bit out of my depth when it comes to storage.

I also understand I'll have to benchmark to get a proper answer, but I'd like to do as much estimation as I possibly can first.

All help appreciated, flames welcomed!

Best Answer

One area you seem to have missed in this is expected maximum transfer rate. Also, a sense of how 'noisy' your IOPS curve is. If it is very noisy you can have sustained periods of significantly over average IOPS, and that's a case you'll need to engineer for. From experience, some of the biggest bursty IOPS occur with large transfers, and if those large transfers saturate your I/O subsystem somehow, other actions during those transfers will suffer.

Peak loads do need to be considered, as you want to perform adequately when they occur. This may mean your system is underutilized a lot of the time, but that sort of comes with the territory. We create a minimum service guarantee over the expected load-range and manageable growth, which leads to a certain amount of unavoidable over-engineering.

Another area is expected read/write I/O percentages. You said web-server, so I'm guessing it'll be more reads than writes but you'd know best. If the percentages are heavily skewed towards reads (say, 80% reads) that'll have an effect on what you select for the storage sub-system as you'll be able to afford expensive writes in order to get fast reads (RAID5 or RAID6 for example). But not too expensive, as you don't want to saturate something with a huge write that'll bog down the whole system.

Once you do get hardware, do test failure modes. Figure out how bad things get when a drive has failed, and when one is added back in. If you only have five disks this may not be a big deal as failure rate should be low enough that bad disks should be a very infrequent occurrence. But if you have a lot of spindles (say... over 10) your failure rate may be high enough that you have to consider the 'failed' state into your estimations. We got badly bit by this one a couple years ago, as a certain drive array seriously bottle-necked on writes when it was rebuilding a parity set (it disabled the write cache, evil, evil thing), which caused havoc when someone attempted to write a CD image (625MB!) to it during this period.

And finally, consider the load during backups during your estimation. If you're going to have to provide service when the backup is busily reading everything on the server, that'll also have an effect on how beefy a storage system you get. So, consider utility I/O operations, not just user generated ones.

This should give you a few more data-points to work with!

**edit:* Peak headroom... that depends on load. I have a system that during the day averages between 3-5 MB/s with peaks in the 10-15MB/s range, and backup can push it to 20-25MB/s. Average, therefore, is around 12MB/s, with true peak a bit over twice that. This particular system doesn't suffer significantly during RAID rebuilds so it doesn't enter into planning. Also, end-user driven I/O is minimal during the backup period so I don't have to worry about contention, which means that I can run it flat out during backups without fear of getting calls.