Paper: The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines

In May VMware published a very interesting paper describing the design of the Fault Tolerant (FT) feature, announced for the first time in late 2007 and shipped with vSphere 4.0 in June 2009.

The document, titled The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines, also describes the alternative designs that VMware explored before selecting the actual one::

We have implemented a commercial enterprise-grade system for providing fault-tolerant virtual machines, based on the approach of replicating the execution of a primary virtual machine (VM) via a backup virtual machine on another server. We have designed a complete system in VMware vSphere 4.0 that is easy to use, runs on commodity servers, and typically reduces performance of real applications by less than 10%. Our method for replicating VM execution is similar to that described in Bressoud, but we have made a number of significant design changes that greatly improve performance. In addition, an easy-to- use, commercial system that automatically restores redundancy after failure requires many additional components beyond replicated VM execution. We have designed and implemented these extra components and addressed many practical issues encountered in supporting VMs running enterprise applications. In this paper, we describe our basic design, discuss alternate design choices and a number of the implementation details, and provide an evaluation of our performance for both micro-benchmarks and real applications.

Thanks to Yellow Bricks for the news.