Performance of vSphere 4.1 features emerge: Scalable vMotion, Wide VM Numa, Memory Compression, Storage I/O Control and others

After unveiling the list of features that will appear in the next major release of the VMware vSphere platform (currently numbered 4.1, but likely to change in 4.5 to align with the upcoming release of View 4.5), virtualization.info can now share full details about the performance improvements introduced by some of them, like Scalable vMotion, Wide VM Numa, Memory Compression and others.

Let’s start with the new configuration limits that can vSphere 4.1 can reach:

3,000 virtual machines per cluster (compared to 1,280 in vSphere 4.0)
1,000 hosts per vCenter Server (compared to 300)
15,000 registered VMs per vCenter Server (compared to 4,500)
10,000 concurrently powered-on VMs per vCenter Server (compared to 3,000)
120 concurrent Virtual Infrastructure Clients per vCenter Server (compared to 30)
500 hosts per virtual Datacenter object (compared to 100)
5,000 virtual machines per virtual Datacenter object (compared to 2,500)

The hostd footprint and memory consumption (down by 40%) has been greatly reduced, speeding up some operations by a factor of 3x.

Scalable vMotion
vSphere 4.1 supports up to 8 concurrent virtual machines live migrations and VMware seems to have renamed the feature in Scalable vMotion.
The engine has been significantly reworked to reach a throughput of 8GB/sec on a 10GbE link, 3 times the performance scored in version 4.0.

Wide VM Numa
The vSphere 4.1 NUMA scheduler has been reworked to improve performance when a certain virtual machine needs more cores than the ones available on a certain NUMA node, assuming that the server has a large number of NUMA nodes.
Depending on workloads and configurations, the performance improvement is up to 7%.

Transparent Memory Compression
vSphere 4.1 introduces a new memory over-commit technique called Transparent Memory Compression (TMC) that compresses on the fly the virtual pages that should be otherwise swapped on disk.
Each virtual machine has a compression cache where vSphere can store compressed pages of 2KB or smaller size.

TMC is enabled by default on ESX/ESXi 4.1 hosts but the administrator can define the compression cache limits or disable TMC completely.

This results in a performance regain of 15% when there’s a fair amount of memory over-commitment and a regain of 25% in case of heavy over-commitment.

Storage I/O Control
vSphere 4.1 introduces the capability to define quality of service prioritization for the I/O activity on a single host or a cluster of hosts.
The prioritization, which can be enable or disable on specific datastores, is enforced through shares and limits.

The ESX/ESXi hosts monitors the latency in communication with the datastore of choice. As soon such latency exceeds a defined threshold the datastore is considered congested.
At that point all VMs accessing that datastore are prioritized according to their defined shares.
The administrator can even define the amount of I/O operations per second (IOPS) that each virtual machine can reach.

Here’s an example:

VMware reports an improvement up to 36% in certain scenarios.

Additional performance enhancements
vSphere 4.1 introduces additional improvements in other areas like Storage vMotion, thanks to the support for 8GB Fibre Channel HBAs.
VMware reports a performance improvement (in terms of IOPS) of 50% compared to 4GB FC HBAs and five times better performance in throughput usage.

Support for NFS storage has been improved too, and it now features up to 15% reduction in CPU cost for read & writes as well as up to 15% improvement in throughput usage.

iSCSI support has been improved too, with the new support for iSCSI TCP Offload Engine (TOE) network interface cards.
VMware reports a performance improvement of up to 89% in CPU read cost and 83% in CPU write cost.

vSphere 4.1 has additional new capabilities for its networking layer with the introduction of support for Large Receive Offload (LRO), which allows to receive network packets larger than the MTU size.
This is useful only for Linux guest operating systems that support LRO.
LRO support translates in a 5-30% performance improvement in throughput usage and 40-60% decrease in CPU cost depending on workloads.

vSphere 4.1 also introduces asynchronous transmission of network packets through a new TX Wordlets scheduler.
It translates into a throughput improvement of 2X for intra-VMs traffic and up to 10% throughput improvement for VM-to-host traffic.

Last but not least, vSphere 4.1 also introduces better performance for VDI when used in conjunction with View.
The creation of new virtual desktops now is 60% faster and their power on timing is 3.4 times faster.