Single-thread app 50% slower on VMware X5650 than physical E5450

vmware-esxvmware-vsphere

Our application runs 50% slower on a Xeon X5650 under ESX than on a bare-metal E5450.

A test task on the physical servers takes 17 minutes.

The same task on the virtual servers takes 25 minutes: 50% longer.

From everything I can tell, this should be impossible; the 5600 series is supposedly 10-20% faster than the 5400 series for single-threaded processes at the same clock speed, and virtualization overhead should similarly low for single-thread CPU-bound workloads. Performance should at least break even, shouldn't it?
But instead of having equal or better performance, performance is cut by 1/3.


UPDATE: Solved. The same task on the (fixed) virtual servers takes 14 minutes: 15% faster.

It was the RAM configuration. The 50% performance drop was because the ESX host system memory installed wrong and only providing around half of the full possible bandwidth. For the CPU & memory-bound process, that loss of bandwidth translated into 50% worse performance than expected.

Application performance is now smack in the middle of the 10-20% improvement we originally expected.


There are two physical Windows Server 2003 R2 systems which run an application which consists of a single-threaded calculation run on one server (32-bit), talking back and forth to an SQL Server 2005 database on the other (64-bit).

Both physical boxes are single-CPU E5450 with 4GB RAM @800Mhz. The calculation server never uses more than 1.5GB physical memory, and the SQL Server never uses more than 2.5GB. CPU utilization on the calculation server never goes over ~15% (around 50% of a single core). CPU utilization on the DB server never goes above ~25% (a fully-utilized single core).

The physical ESX 4.1 hosts are dual-CPU X5650 with 64GB RAM @1333Mhz. The virtual machines are given 4 cores and 4GB RAM each, to mirror the physical environment. The test was made with a single VM running on each physical host, as well as with both running on the same host.

Interestingly, we get very nearly the same 25-minute test results on another pair of ESX servers using X5550 CPUs and RAM @1066Mhz.

Also, test results in the virtual systems do not vary by more than 10% either way giving the VMs 1, 2, or 4 CPUs, or 1, 2, 4, or 8GB RAM. There is very little network or disk activity, and as far as I can tell the process should be CPU bound.

Tests have been run using both local 15K SAS disks on the separate hosts, as well as to an gigabit iSCSI SAN also with 15K disks. There is negligible different in results for different storage.

From everything I can tell, the Xeon 5600-series should be 20-50% faster than the 5400-series for single-threaded workloads. Even considering that the X5650 is a 2.67GHz part and the E5450 a 3GHhz part, if per-core performance was equal at the same clock speed you would still expect to see at least 90% of the performance instead of 67%. This doesn't even take into account the fact that the memory clock is nearly twice the speed.

It should be said that I've done several virtualization projects in the past, and never seen a anything close to a 50% performance degredation even using the SAME physical CPU cores, let alone cores two generations newer with faster memory.

Any ideas on possible causes or any configuration settings I should check?

Best Answer

Depending on the type of virtualization, 5% overhead is pretty much best case scenario. With full paravirtualization, you can achieve such overhead on IO-light workloads quite easily. With hardware-assisted virtualization (technology used by VMWare), it is possible to achieve such a low overhead on IO-light workloads on an hypervisor with few VM. With full virtualization (no CPU extensions), 5% overhead is pretty much a dream.

Keep in mind this can depend on very many factors. Virtualization tends to add a significant amount of latency between disks and the guest OS. This will increase IO wait and therefore load averages while keeping CPU usage rather low. If your storage is on the lower side of the IOPS scale, this will have a very big impact. If you are using network storage, this will almost always add latency due to having to access a network for each IO instead of just accessing an internal bus.

Virtualization can also add extra network latency if you use special network configuration modules such as virtual switches but this usually is not very significant.

Virtualization tends to add many extra interrupts which are required to switch from a VM to another. Depending on the scheduler of the hypervisor, this can be significant. There isn't much you can do about this since it is just due to the nature of virtualization. But it is something to keep in mind as a justification to lower performance.

Due to the single-threaded nature of your application, having more cores will yield no significant performance improvement. Both CPUs have similar frequencies but you will notice the X5650 has a slower frequency without "Turbo Boost". You may want to check that feature is compatible/enabled with your setup.

33% overhead on IO intensive workload is, I find, not so bad. Try separating the storage for your two VMs and see if it helps.