Accurately trending random I/O performance for capacity planning

capacity-planningperformanceperformance-monitoringvirtualization

Where I work we have a numerous "big iron" servers which are used used for hosting many virtual machines using a Xen Hypervisor. These are typically configured with 32GB RAM, Dual Quad core processes and fast disks with gobs of I/O capacity.

We're at the point in time where the existing hardware configuration is getting a bit long in the tooth and it is time to go out and source bigger, faster and shinier new hardware.

As mentioned above, the existing kit has been deployed with 32GB RAM and that has effectively limited the number of VMs that we can deploy to a host.

In investigating newer hardware though, it is evident that you can get more and more RAM within a single machine with 64, 72 or even 96GB within a single chassis. Evidently, this will allow us to get more machines to a given host which is always a win. Analysis completed so far suggests that the limiting factor will now be shifted to the disk subsystem.

The problem is now, trying to get some idea of where we're at… By virtue of the usage, we know that we're not limited in terms of I/O bandwidth, more-so, the number of random I/O operations which can be completed.. We know anecdotally that once we hit this point then iowait is going to sky rocket and the entire machine performance is going to go to the dogs.

Now this is the crux of the question I am asking, is anyone aware of a way to accurately tracking/trending existing I/O performance specifically with relation to the number of random I/O ops being completed?

What I am really trying to get a metric on is "this configuration can successfully handle X number of random I/O requests, and we're currently (on average) doing Y ops with a peak of Z ops".

Thanks in advance!

Best Answer

sar does the job nicely here; it'll collect the number of transactions as well as sectors read/written per second, which can be used to then replay your IO workload with relatively decent accuracy (in terms of read/write ratios, as well as transaction size, which is the determining factor in how "random" your IO is). It's not perfect, but in my experience it does a good enough job to do the sort of estimation you're looking at.