Guest star author: Kevin Lawton, Lead developer of Bochs.
VMs live migration, VMotion in VMware parlance, is a key technology underlying a number of useful features. For example, VMware’s DRS and DPM features use migration to perform load balancing and power management respectively. These are in essence high level forms of scheduling, though with much coarser granularity of time at which an operating system schedules.
Given a migration within the same storage and networking domains, there is still a considerable amount of VM memory which has to be transferred between source and destination servers, through a finite amount of networking bandwidth. On a 1GbE network for example, a VM with 2GB of RAM might have a best-case migration time on the order of 20 seconds. Or on a 10GbE network, the same VM might have a best-case migration time on the order of 2 seconds. In some cases, live migration takes minutes to complete.
Using relatively slow VM migration as a mechanism for scheduling, has a number of risks and short-comings, which leave its full potential untapped.
This is necessarily true, because the time within which workloads can ebb and flow (and spike) is much quicker than the response time available to the scheduler to re-schedule VMs on other servers. As a result, the scheduler has to be ultra-conservative, otherwise it may break SLAs and/or create troublesome load-based hot-spots. By contrast, if the scheduler could expect near instantaneous VM migrations, it could perform much higher fidelity load-balancing or much more efficient power management (packing VMs onto absolutely the fewest number of powered-on servers). Thus, as live VM migration times decrease, the less conservatism is needed, and the greater the amount of potential performance and power savings can be wrung out of existing resources.
Accelerating VM migration
One of the keys to accelerating VM migration time is to make use of duplicate memory throughout the compute fabric.
As is the case within a given physical server, there is generally a considerable amount of duplicate memory contents. The more similar the VMs, the higher the percentage of duplication. But rather than look for intra-server memory duplication, I researched looking at memory contents across a whole fabric (cloud) of servers.
Each server becomes a member of a collective, which act in concert to identify and properly mark memory duplication through the fabric. Of course, this requires some new infrastructure and a much broader set of techniques to handle memory duplication analysis, but it can potentially plug into existing virtualization hypervisors. A strong benefit is that by extending the amount of memory to analyze, to the entire chosen universe of servers, much more duplication can be found, even if any given physical server hosts a diverse set of VMs at any one time. By contrast, intra-server content sharing is limited in scope to only the memory contents of the current VM workloads (less opportunity, and more VM diversity constrained).
With a distributed memory duplication network in-place, some really tangible benefits result.
The first is that memory which is recognized to exist on both source and destination of a migration, does not need to be transferred. It is simply copied from the existing memory contents on the destination (or referred to by a pointer). This eliminates both the networking transfer time associated with duplicate state, and the bandwidth it would have consumed.
So for example, if one could find 75% redundancy, only 25% of the memory contents plus some meta information about the elided contents need be transferred.
Now, this turns out to be a leveraged optimization due to the way live migration works.Generally a pre-copy of memory state is sent in a 1st pass, while the VM continues execution. Subsequent passes transfer only deltas since the last pass, until a threshold (small enough amount of data) is reached, in which case the VM is stunned and the remaining deltas are transferred.
Any reduction in data transferred on the 1st pass, also is handsomely rewarded by narrowing the window of time (and thus the amount of memory deltas) for the 2nd pass! A smaller 2nd pass requires a smaller 3rd pass, and so on.
To make this more concrete, this VMware talk (slide 11) shows a nominal progression of passes for a 2GB VM (over a 1GbE network), of 16s, 4s, 1s, 0.25s. With the memory duplication network, my research shows this can be nominally reduced to 4s, 1.0s, 0.25s. Or about 4x faster than current technology.
Pushing the envelope
Optimizing by 4x would accelerate the ~20s migration case above to ~5s (1GbE) and the ~2s case to ~0.5s (10GbE)!
That alone is enough to make the time granularities more attractive for far more aggressive load and power management scheduling. But there are ways to optimize even further, with various trade-offs of power/compute/memory.
First, a number of techniques have been researched for optimizing memory sharing, such as using sub-page granularity (up to 65%) and a differencing engine (up to a phenomenal 90%). But a second technique can be used independently or in conjunction with other techniques.
Given a distributed memory sharing network, even non-duplicate memory can be transferred (replicated) to other nodes in the network on speculation. As this replication is pure speculation, corresponding memory contents can be dropped at-will and used for more immediate needs. So it can use “spare” memory in the network of servers.
Of course, high rate of change memory areas are not necessarily good candidates for such replication. But, I found in my research that it’s not hard to get to 80% .. 90% effective sharing using this technique, and thus VM migration acceleration can reach 5x .. 10x before employing any exotic memory sharing techniques.
Now another benefit of this technology is more evident when doing longer distance VM migrations, over a more constrained networking pipe.
This use case is still evolving, as is evidenced by VMware’s keynote talk at VMworld. But let’s consider a VM migration between two distant data centers. Obviously, it’s good to saturate finite networking resources with less VM memory data, using the elision techniques.
What’s also interesting is that we don’t necessarily have to have the duplicate data between source and destination servers. We just need to have the duplicated data somewhere in the destination data center, preferably near the destination.
In that case, we can transfer the duplicated data in a short-haul fashion intra-data-center, and the unique data long-haul inter-data-center. And we can do it in parallel!
Given a scalable memory sharing network infrastructure, the bigger the cloud of virtualization, the more of these kinds of optimization opportunities exist. And the more spare memory is available to speculate with, for unique memory replication.
Power
The 1st VM executed on a hypervisor costs the most money in terms of power, as there is a lot of overhead in power up a server and the related chips. Adding more VMs (and thus requiring more MHz and power) costs incrementally less, as the initial power-on costs have already been paid. Allowing finer grained scheduling has the advantage of allowing a greater average load (and VM density) on a physical server, and thus allows both a better utilization and power efficiency of reso
urces. I estimated from a rough model, that utilization could be increased another 15+% if more rapid scheduling is used. That can translate to 15+% less hardware, and/or some non-trivial power savings.
Scheduling
I believe scheduling for load balancing and power management will continue evolving for some time. What’s nice about the techniques I describe herein, is that a much richer amount of knowledge exists which can be input into scheduling decisions. As the sharing potential between any two VMs can be divined, inter-server scheduling can for example, decide to place more similar VMs on the same physical server to get better density. And as has been the case, a server often runs out of RAM before it runs out of compute capacity. So better VM density is important for obtaining peak utilization.
Growing with the Cloud
This is a technology with a very immediate benefit to the existing use-case of virtualization, where a lot of VMs tend not to live-migrate outside of a particular physical location. But it scales nicely across multiple data center locations, and will grow with virtualization as it spans multiple physical locations, as per the VMworld keynote. In fact in the latter case, it will be absolutely critical to optimize VM migration times, and reduce the amount of network bandwidth consumed. And I believe the technology will couple nicely with future storage and networking continuity solutions.
About the author
Kevin Lawton is a pioneer in x86 virtualization, serial entrepreneur, business and technology visionary, prolific idea creator, news and business book junkie. Founding team member in a microprocessor startup, the author and lead for two Open Source projects, and a public speaker. He has a degree in computer science and started his career at MIT Lincoln Laboratory.
Contact him here. Note that the research is patent pending.