Are repeatable performance tests possible on a VM

performancetestingvirtual machine

My application does a lot of database inserts, so disk I/O is a big part of the workload. QA does almost all testing on VM's. I'm concerned that tests intended to detect performance regressions won't give valid repeatable results in a VM environment, since other activity on the physical machine will affect the application performance.

Is this a legitimate concern, or do modern virtual environments have a way of truly isolating an application's environment which would allow for repeatable performance tests?

I want to put my application and database on a "machine", run a test and note how much time it took (which will be some number of hours/minutes, not seconds/milliseconds). Later in my development cycle, I want to run the same test and check whether performance has regressed due to code changes. When running on a dedicated physical machine, I get reasonably consistent results. My question is, if I run this test on a virtual machine, might I get significant differences in run time due to work being done by other VM's on the same physical box? Is there any way to configure the VM to control for this, considering that disk I/O is a major part of the workload?

Best Answer

To address some of the comments first, yes all the hypervisors worth building a VM environment upon have the ability to limit resource consumption and also guarantee minimums. You can leverage these settings to establish performance service levels. Establish the minimum requirements and translate that into a VM sizing. For example, a 2 core vCpu and 4GB RAM, and allocate a VM that cannot be given anymore and validate the app performance is within the requirements.

This guide from vmWare explains from that products perspective: https://www.vmware.com/files/pdf/partners/tap/directions-vmware-ready-testing-application-software.pdf

A key point in that guide is to do your performance testing while also monitoring the hypervisor to make sure the CPU < 70% and memory is not starved so that virtual swapping is happening.

A general rule of thumb is when either of those things happen (CPU of the hypervisor > 70% or physical memory exhausted on the hypervisor) you can expect to see a degradation of performance of the VMs.

You can do the same for the storage and net IO dimensions, but this is going to be driven by the investment made in the underlying architecture. My organization puts VMs on SAN with multiple 10GB uplinks to the SAN, and that results in the VMs seeing higher storage IO than they would see with local disk.

Ideally, the organization your app will be deployed has a managed VM infrastructure with monitoring of the hypervisors being done and being proactive to manage such that the hypervisors aren't pushed to the point that creates VM slow down. If you can, get metrics on the hypervisor to understand its general performance trends, and then also track that while testing and state your app's requirements based on the assumption that the hypervisor will be managed to stay within those same parameters.

If you can't get metrics about the hypervisors resource and performance, then really all you can do is measure resource used by your OS and make that the requirement that needs to be guaranteed by the hypervisor.

Since your concern is mainly around storage IO, this blog focuses on measuring IOPs: https://blog.synology.com/?p=146

Related Solutions

Unit-testing – Is it a good idea to measure the performance of a method by using unit test timeout

We are using this approach as well, i.e. we have tests that measure runtime under some defined load scenario on a given machine. It may be important to point out, that we do not include these in the normal unit tests. Unit tests are basically executed by each developer on a developer machine before commiting the changes. See below for why this doesn't make any sense for performance tests (at least in our case). Instead we run performance tests as part of integration tests.

You correctly pointed out, that this should not rule out verification. We do not assume our test to be a test of the non-functional requirement. Instead, we consider it a mere potential-problem-indicator.

I am not sure about your product, but in our case, if performance is insufficient, it means a lot of work is required to "fix" that. So the turn-around time, when we leave this entirely to QA is horrible. Additionally, the performance fixes will have severe impacts on a large part of the code-base, which renders previous QA work void. All in all, a very inefficient and unsatisfying workflow.

That being said, here are some points to your respective issues:

conceptually: it is true, that this is not what unit tests are about. But as long as everyone's aware, that the test is not supposed to verify anything that QA should do, it's fine.
Visual Studio: can't say anything about that, as we do not use the unit test framework from VS.
Machine: Depends on the product. If your product is something developed for end-users with custom individual desktop machines, then it is in fact more realistic to execute the tests on different developers' machines. In our case, we deliver the product for a machine with a given spec and we execute these performance tests only on such a machine. Indeed, there is not much point in measuring performance on your dual-core developer machine, when the client ultimately will run 16 cores or more.
TDD: While initial failure is typical, it's not a must. In fact, writing these tests early makes it serve more as a regression test rather than a traditional unit test. That the test succeeds early on is no problem. But you do get the advantage, that whenever a developer adds functionality that slows down things, because s/he was not aware of the non-functional performance requirement, this TDD test will spot it. Happens a lot, and it is awesome feedback. Imagine that in your daily work: you write code, you commit it, you go to lunch and when you're back, the build system tells you that this code when executed in a heavy load environment is too slow. That's nice enough for me to accept, that the TDD test is not initially failed.
Run-time: As mentioned, we do not run these tests on developer machines, but rather as part of the build system in a kind of integration test.

Why would more CPU cores on virtual machine slow compile times

Answer: It doesn't slow down, it does scale up with # of CPU cores. The project used in the original question was 'too small' (it's actually a ton of development but small/optimized for a compiler) to reap the benefits of multiple cores. Seems instead of planning how to spread the work, spawning multiple compiler processes etc, at this small scale it's best to hammer at the work serially right off the bat.

This is based off the new experiment I did based off the comments to the question (and my personal curiosity). I used a larger VS project - Umbraco CMS's source code since it's large, open sourced and one can directly load up the solution file and rebuild (hint: load up umbraco_675b272bb0a3\src\umbraco.sln in VS2010/VS2012).

NOW, what I see is what I expect, i.e. compiles scale up!! Well, to a certain point since I find:

Table of results

Takeaways:

A new VM core results in a new OS X Thread within the VirtualBox process
Compile times scale up as expected (compiles are long enough)
At 8 VM cores, core emulation might be kicking in within VirtualBox as the penalty is massive (50% hit)
The above is likely because OS X is unable to present 4 hyper-threaded cores (8 h/w thread) as 8 cores to VirtualBox

That last point caused me to monitor the CPU history across all the cores via 'Activity Monitor' (CPU history) and what I found was

OS X CPU history graph

Takeaways:

At one VM core, the activity seems to be hopping across the 4 HW cores. Makes sense, to distribute heat evenly at core levels.
Even at 4 Virtual cores (and 27 VirtualBox OS X threads or ~800 OS X thread overall), only even HW threads (0,2,4,6) are almost saturated while odd HW threads (1,3,5,7) are almost at 0%. More likely the scheduler works in terms of HW cores and NOT HW threads so I speculate perhaps the OSX 64bit kernel/scheduler isn't optimized for hyper threaded CPU? Or looking at the 8VM core setup, perhaps it starts using them at a high % CPU utilization? Something funny is going one ... well, that's a separate question for some Darwin developers ...

[edit]: I'd love to try the same in VMWare Fusion. Chances are it won't be this bad. I wonder if they showcase this as a commercial product ...

Footer:

In case the images ever disappear, the compile time table is (text, ugly!)

Cores in    Avg compile      Host/OSX    Host/OSX CPU
   VM         times (sec)   Threads      consumption
    1           11.83            24        105-115%
    2           10.04            25        140-190%
    4            9.59            27        180-270%
    8           14.18            31        240-430%

Best Answer

Related Solutions

Unit-testing – Is it a good idea to measure the performance of a method by using unit test timeout

Why would more CPU cores on virtual machine slow compile times

Related Topic