Intel Vanderpool holds promise, some pitfalls

Quoting from The Inquirer:

Intel introduced VT or Vanderpool Technology a few IDFs ago with great fanfare and little hard information. Since then, as the technology got closer and closer to release, there has been a little more info, but even more questions. In the following four part article, I will tell you a little about what VT is, and what it does for you. The first part is about virtualisation, what it is, what it does, followed by what problems it has in the second part. The third chapter will be more on Vanderpool (VT) itself and how it works on a technical level. The closing chapter will be on the uses of VT in the real world, and most likely how you will see, or hopefully not see, it in action.
Virtualisation is a way to run multiple operating systems on the same machine at the same time. It is akin to multitasking, but where multitasking allows you to run multiple programs on one OS on one set of hardware, virtualisation allows multiple OSes on one set of hardware. This can be very useful for security and uptime purposes, but it comes at a cost.

Imagine an OS that you can load in nearly no time, and if it crashes, you can simply throw it out, and quickly load a new one. If you have several of these running at the same time, you can shut one down and shunt the work off to the other ones while you are loading a fresh image. If you have five copies of Redhat running Apache, and one goes belly up, no problem. Simply pass incoming requests to the other four while the fifth is reloading.

If you save ‘snapshots’ of a running OS, you can reload it every time something unpleasant happens. Get hacked? Reload the image from a clean state and patch it up, quick. Virused? Same thing. Virtualisation provides the ability to reinstall an OS on the fly without reimaging a hard drive like you would with Ghost. You can simply load, unload and save OSes like programs.

It also allows you to run multiple different OSes on the same box at the same time. If you are a developer that needs to write code that will run on 95, 98, ME, 2000 and XP, you can have five machines on your desk or one with five virtual OSes running. Need to have every version of every browser to check your code against, but MS won’t let you do something as blindingly obvious as downgrading IE? Just load the old image, or better yet, run them all at once.

Another great example would be for a web hosting company. If you have 50 users on an average computer, each running a low level web site, you can have 50 boxes or one. 50 servers is the expensive way to go, very expensive, but also very secure. One is the sane way to go, that is until one person wants Cold Fusion installed, but that conflicts with the custom mods of customer 17, and moron 32 has a script that takes them all down every Thursday morning at 3:28am. This triggers a big headache for tech support as they get hit with 50 calls when there should be one.

Virtualisation fixes this by giving each user what appears to be their own computer. For all they know they are on a box to themselves, no muss, no fuss. If they want plain vanilla SuSE, Redhat with custom mods, or a Cold Fusion setup that only they understand, no problem. That script that crashes the machine? It crashes an instance, and with any luck, it will be reloaded before the person even notices the server went down, even if they are up at 3:28am. No one else on the box even notices.

But not all is well in virtulisation land. The most obvious thing is that 50 copies of an OS on a computer take up more resources and lead to a more expensive server. That is true, and it is hard to get around under any circumstances, more things loaded take more memory.

The real killer is the overhead. There are several techniques for virtualisation, but they all come with a performance hit. This number varies wildly with CPU, OS, workload and number of OSes you are running, and I do mean wildly. Estimates I hear run from 10% CPU time to over 40%, so it really is a ‘depends’ situation. If you are near the 40% mark, you are probably second guessing the sanity of using a VM in the first place.

The idea behind VT is to lower the cost of doing this virtualisation while possibly adding a few bells and whistles to the whole concept. Before we dig into how this works, it helps if you know a little more about how virtualisation accomplishes the seemingly magic task of having multiple OSes running on one CPU.

There are three main types of virtualisation: Paravirtualisation, Binary Translation and emulation. The one you may be familiar with is emulation, you can have a Super Nintendo emulator running in a window on XP, and in another you have Playstation emulator, and in the last you have a virtual Atari 2600. This can be considered the most basic form of virtualisation, as far as any game running is concerned, it is running on the original hardware. Emulation is really expensive in terms of CPU overhead, if you have to fake every bit of the hardware, it can take a lot of time and headaches. You simply have to jump through a lot of hoops, and do it perfectly.

The other end of the spectrum is the method currently in vogue, and endorsed by everyone under the sun, Sun included, Paravirtualisation (PV). PV is a hack, somewhat literally, it makes the hosted OSes aware that they are in a virtualised environment, and modifies them so they will play nice. The OSes need to be tweaked for this method, and there has to be a back and forth between the OS writers and the virtualisation people. In this regard, it isn’t as much a complete virtualisation as it is a cooperative relationship.

PV works very well for open source OSes where you can tweak what you want in order to get them to play nice. Linux, xBSD and others are perfect PV candidates, Windows is not. This probably explains why RedHat and Novell were all touting Xen last week and MS was not on the list of cheerleaders.

The middle ground is probably the best route in terms of tradeoffs; it is Binary Translation (BT). What this does is look at what the guest OS is trying to do and changes it on the fly. If the OS tries to execute instruction XYZ, and XYZ will cause problems to the virtualisation engine, it will change XYZ to something more palatable and fake the results of what XYZ should have returned. This is tricky work, and can be CPU time consuming, both for the monitoring and the fancy footwork required to have it all not blow up. Replacing one instruction with dozens of others is not a way to make things run faster.

When you add in things like self modifying code, you get headaches that mere mortals should not have. None of the problems have a simple solution, and all involve tradeoffs to one extent or another. Very few problems in this area are solved, most are just worked around with the least amount of pain possible.

So, what are these problems? For the x86 architecture, at least in 32 bits, there are enough to drive you quite mad, but as a general rule, all involve ring transitions and related instructions. Conceptually rings are a way to divide a system into privilege levels, you can have an OS running in a level that a user’s program can not modify. This way, if your program goes wild, it won’t crash the system, and the OS can take control, shutting down the offending program cleanly. Rings enforce control over various parts of the system.
There are four rings in x86, 0, 1, 2, and 3, with the lower numbers being higher privilege. A simple way to think about it is that a program running at a given ring can not change things running at a lower numbered ring, but something running at a low ring can mess with a higher numbered ring.

In practice, only rings 0 and 3, the highest and lowest, are commonly used. OSes typically run in ring 0 while user programs are in ring 3. One of the ways the 64-bit extensions to x86 ‘clean up’ the ISA is by losing the middle rings, 1 and 2. Pretty much no one cared that they are gone, except the virtualization folk.

Virtual Machines (VM) like VMware obviously have to run in ring 0, but if they want to maintain complete control, they need to keep the OS out of ring 0. If a runaway task can overwrite the VM, it kind of negates half the reason you want it in the first place. The obvious solution is to force the hosted OS to run in a lower ring, like ring 1.

This would be all fine and dandy except that the OSes are used to running in ring 0, and having complete control of the system. They are set to go from 0 to 3, not 1 to 3. In a PV environment, you change the OS so it plays nice. If you are going for the complete solution, you have to force it into ring 1.

The problem here is that some instructions will only work if they are going to or from ring 0, and other will behave oddly if not in the right ring. I mean ‘oddly’ in a computer way, IE a euphemism for really really bad things will happen if you try this; wear a helmet. It does prevent the hosted OS from trashing the VM, and also prevents the programs on the hosted OS from trashing the OS itself, or worse yet, the VM. This is the ‘0/1/3’ model.

The other model is called the ‘0/3’ model. It puts the VM in ring 0 and both the OS and programs all in ring 3, but essentially it does the rest of the things like the 0/1/3 model. The deprivileged OS in ring 3 can be walked on by user programs with much greater ease, but since there are not ring traversals, an expensive operation, it can run a little faster. Speed for security.

Another slightly saner way to do 0/3 is to have the CPU maintain two sets of page tables for things running in ring 0. One set of page tables would be for the OS, the other set for the old ring 3 programs. This way you have a fairly robust set of memory protections to keep user programs out of OS space, and the other way around. Of course this will once again cost performance, just in a different way, resulting in the other end of the speed vs security tradeoff.

To sum it all up, in 0/1/3, you have security, but take a hit when changing from 3 to 1, 3 to 0, or 1 to 0, and back again. In 0/3, you have only the 0 to 3 transition, so it could potentially run faster than a non-hosted OS. If you have a problem, the 0/3 model is much more likely to come down around your ears in a blue screen than the 0/1/3 model. The future is 0/3 though, mainly because, as I said earlier, 64-bit extensions do away with rings 1 and 2, so you are forced to the 0/3 model. That is, in computer terms, called progress, much in the same way devastating crashes are considered ‘odd’.

On paper this seems like the perfect thing to do, if you can live with a little more instability, or in the case of 0/1/3, a little speed loss. There are drawbacks though, and they broadly fall into four categories of ‘odd’. The first is instructions that check the ring they are in, followed by instructions that do not save the CPU state correctly when in the wrong ring. The last two are dead opposites, instructions that do not cause a fault when they should, and others that fault when they should not, and fault a lot. None of these make a VM writer’s life easier, nor do they speed anything up.

The first one is the most obvious, an instruction that checks the ring it is in. If you deprivilege an OS to ring 1, and it checks where it is running, it will return 1 not 0. If a program expects to be in 0, it will probably take the 1 as an error, probably a severe error. This leads to the user seeing blue, a core dump, or another form of sub-optimal user experience. Binary Translation can catch this and fake a 0, but that means tens or hundreds of instructions in the place of one, and the obvious speed hit.

Saving state is a potentially worse problem. Some things in a CPU are not easily saved on a context switch. A good example of this are ‘hidden’ Segment-Register States. Once they are loaded into memory, some portions of them cannot be saved leading to unexpected discontinuities between the memory resident portions and the actual values in the CPU. There are workarounds of course, but they are tricky and expensive performance wise.

Instructions that do not fault when they should pose an obvious problem. If you are expecting an instruction to cause a fault that you later trap, and it doesn’t, hilarity ensues. Hilarity for the person writing code in the cubicle next to you at the very least. If you are trying to make things work under these conditions, it isn’t all that much fun.

The opposite case is things that fault when they should not, and these things tend to be very common. Writes to CR0 and CR4 will fault if not running in the correct ring, leading to crashes or if you are really lucky, lots and lots of overhead. Both of these fault trappings, or lack thereof are eminently correctable on the fly, but they cost performance, lots of performance.

What the entire art of virtualization comes down to is moving the OS to a place where it should not be, and running around like your head is on fire trying to fix all the problems that come up. There are a lot of problems, and they happen quite often, so the performance loss is nothing specific, but more of a death by 1000 cuts.