This year’s Linux Symposium, taking place in Ottawa from June 19th to 22nd, has very interesting sessions about virtualization with speakers from Xen (which will introduce version 3.1), VMware, IBM, Intel and others:
- Transparent Paravirtualization for Linux
Paravirtualization has a lot of promise, in particular in its ability to deliver performance by allowing the hypervisor to be aware of the idioms in the operating system. Since Linux kernel changes are necessary, it is very easy to get into a situation where the paravirtualized kernel is incapable of executing on a native machine, or on another hypervisor. It is also quite easy to expose too many hypervisor metaphors in the name of performance, which can impede the general developement of the kernel with many hypervisor specific subtleties.
VMI, or the Virtual Machine Interface, is a clearly defined extensible specification for OS communication with the hypervisor. VMI delivers great performance, but doesnt require that Linux kernel developers be aware of metaphors that are only relevant to the hypervisor. There is a clear distinction between the resource name space available to the virtual machine and the hypervisor. As a result, it can keep pace with the fast releases of the Linux kernel, a new kernel version can be trivially paravirtualized. With VMI, a single Linux kernel binary can run on a native machine and on one or more hypervisors. VMI, in a natural way, promotes hypervisor diversity.
We provide a working patch to the latest Linux revision, performance data (a) on native to show the negligible cost of VMI and (b) on the VMware hypervisor to show its benefits. We will also share some future work directions. - Utilizing IOMMUs for Virtualization in Linux and Xen
IOMMUs are hardware devices that translate device DMA addresses to proper machine physical addresses. IOMMUs have long been used for RAS (prohibiting devices from DMA’ing into the wrong memory) and for performance optimization (avoiding bounce buffers and simplifying scatter/gather). With the increasing emphasis on virtualization, IOMMUs from IBM, Intel and AMD are being used and re-designed in new ways, e.g., to enforce isolation between multiple operating systems with direct device access. These new IOMMUs and their usage scenarios have a profound impact on some of the OS and hypervisor abstractions and implementation.
We describe the issues and design alternatives of kernel and hypervisor support for new IOMMU designs. We present the design and implementation of the changes made to Linux (some of which have already been merged into the mainline kernel) and Xen, as well as our proposed roadmap. We discuss how the interfaces and implementation can adapt to upcoming IOMMU designs and to tune performance for different workload/reliability/security scenarios. We conclude with a description of some of the key research and development challenges new IOMMUs present. - X86-64 XenLinux: Architecture, Implementation, and Optimizations
Xen 3.0 has been officially released with x86-64 support added. In this paper, we discuss the architecture, design decisions, and various challenging issues we needed to solve when we para-virtualized x86-64 Linux.
Although we reused the para-virtualization techniques and code employed by x86 XenLinux as much as possible, there are notable differences between x86 XenLinux and x86-64 XenLinux. Because of the limited segmentation with x86-64, for example, we needed to run both the guest kernel and applications in ring 3, raising the problem of protecting one from the other. This also complicated system calls handling, event handling, including exceptions such as page faults and interrupts. For example the native device drivers run in Ring 3 in x86-64 XenLinux today.
Xen itself was required to extend to support x86-64 XenLinux. To handle transitions between kernel and user mode securely, for example, Xen is aware of the mode of the guests controlling the page tables used for each mode. We also discuss other extensions to x86 XenLinux, in support of x86-64, including page table management, 4-level writable page tables, shadow page tables for live migration, new hypercalls, DMA, and IA-32 binary support.
The current x86-64 XenLinux has compelling performance for practical use. We compare performance between the native x86-64 Linux and XenLinux, and analyze the causes of visible performance regressions. We also discuss performance optimizations, especially how to overcome the overheads caused by the transitions between user and kernel mode. Optimization experiments are also present.
Finally we discuss how we can merge the patches for x86-64 XenLinux in the upstream, and we present such efforts in that direction. We present a summary of the common codes with x86 XenLinux, and the changes to the native x86-64 Linux. - Linux as a Hypervisor – An Update
Through its history, the Linux kernel has had increasing demands placed on it as it supported new applications and new workloads. A relatively new demand is to act as a hypervisor, as virtualization has become increasingly popular. In the past, there were many weaknesses in the ability of Linux to be a hypervisor. Today, there are noticably fewer, but they still exist.
Not all virtualization technologies stress the capabilities of the kernel in new ways. User-mode Linux (UML) is the only prominent example of a virtualization technology which uses the capabilities of a stock Linux kernel. As such, UML has been the main impetus for improving the ability of Linux to be a hypervisor. A number of new capabilities have resulted in part from this, some of which have been merged and some of which haven’t. Many of these capabilities have utility beyond virtualization, as they have also been pushed by people who are interested in applications that are unrelated to virtualization. An early problem was the inability of ptrace of Linux/i386 to nullify intercepted system calls. This was fixed very early, as it is essential in order to virtualize system calls. Another ptrace weakness was its requirement that both system call entry and exit must be intercepted. A ptrace extension, PTRACE_SYSEMU addresses this. It causes only system call entrances to be intercepted, causing a noticable performance improvement in UML, even on workloads that aren’t system call-dependent. UML wasn’t one of the main drivers behind AIO and O_DIRECT but it benefits from them. These allow UML to behave more like the host kernel by allowing multiple outstanding I/O requests and to be more fully in charge of its own memory by bypassing the host’s caching. Another I/O improvement that improves the virtualization capabilities of the kernel is the ability to poke a hole in a file. Proposals for a sys_punch system call had circulated for years. MADVISE_REMOVE, which was the first to be merged, removes a range of pages from a tmpfs file. This allows Linux to support memory hotplug in its guests. FUSE (Filesystems in Userspace) is another recent addition of interest. It doesn’t contribute to the ability to host virtual machine, but it does contribute to the ability to manage them. UML is using FUSE export its filesystem to the host, allowing some guest system management to be done from the host. There are a number of other capabilities which are not merged. The large number of virtual memory areas (VMAs) that UML creates on the host is a noticable performance problem. Ingo Molnar implemented a new system call, remap_file_pages, to fix this problem. This allows pages within a mapping to be rearranged, reducing the number of VMAs for a UML process from nearly one per mapped page to one. PTRACE_SYSEMU notwithstanding, system call interception is still a performance problem. Ingo has another patch, VCPU (Virtual CPU), which improves this. In effect, it allows a process to trace itself, eliminating the context switching that ptrace currently requires. - Evolution in the Kernel Debugging Utilizing Hardware Virtualization With Xen
Xen’s ability to run unmodified guests with the virtualization available in hardware opens new doors of possibilities in the kernel debugging. Now it’s possible to debug Linux kernel similar to debugging a user process in the Linux. Since virtualization hardware enables Xen to implement full virtualization, there is no need to change the kernel in any way to debug it. For example you can boot from disk installed with any standard Linux distribution, inside of a fully virtualized Xen guest. Start a gdbserver-xen and remotely connect gdb to it. When you get into gdb, the virtual machine is paused at the reset vector. Now you can use gdb to debug the kernel very similar to how you will use gdb to debug a user process. If you like, you can start debugging from the BIOS code, and then get on to the boot-loader debugging and the kernel debugging. You can poke into the state of the virtual machine, by using the standard gdb commands to access memory locations or registers. If you supply a symbols to the gdb, then you can also use the function and variable names in the gdb. This new debugging technique has few advantages over the kdb; like, there is no need to modify the kernel you are trying to debug. If the kernel misbehaves the debugger is not tossed with it because the debugger is outside of the kernel space. This paper demonstrates the new evolutionary debug techniques using examples. It also explains how the new technique actually works. - Xen 3.1 and the Art of Virtualization
Xen 3 was released in December 2005, bringing new features such as support for SMP guest operating systems, PAE and x86_64, initial support for IA64, and support for CPU hardware virtualization extensions (VT/SVM). In this paper we provide a status update on Xen, reflecting on the evolution of Xen so far, and look towards the future. We will show how Xen’s VT/SVM support has been unified and describe plans to optimize our support for unmodified operating systems. We discuss how a single `xenified’ kernel can run on bare metal as well as over Xen. We report on improvements made to the Itanium support and on the status of the ongoing PowerPC port. Finally we conclude with a discussion of the Xen roadmap. - Virtual Scalability: Charting the Performance of Linux in a Virtual World
Many past topics at Ottawa Linux Symposium have covered Linux Scalability. While still quite valid, most of these topics have left out a hot feature in computing: Virtualization. Virtualization adds a layer of resource isolation and control that allows many virtual systems to co-exist on the same physical machine. However, this layer also adds overhead which can be very light or very heavy. We will use the Xen hypervisor, Linux 2.6 kernels, and many freely available workloads to accurately quantify the scaling and overhead of the hypervisor. Areas covered will include: (1) SMP Scaling: use several workloads on a large SMP system to quantify performance with a hypervisor. (2) Performance Tools: discuss how resource monitoring, statistical profiling, and tracing tools work differently in a virtualized environment. (3) NUMA: discuss how Xen can best make use of large system which have Non-Uniform Memory Access. - HTTP-FUSE Xenoppix
We developed HTTP-FUSE Xenoppix which boots Linux, Plan9, and NetBSD on Virtual Machine Monitor Xen with a small bootable (6.5MB) CD-ROM. The bootable CD-ROM includes boot loader, kernel and miniroot only. Most parts of files are obtained via Internet with network loopback block device HTTP-FUSE CLOOP. HTTP-FUSE CLOOP is made from cloop(Compressed Loopback block device) and FUSE(Filesystem USErspace). It re-constructs a block device from many small split-and-compressed block files of HTTP servers.
The name of each block file is a unique hash value(MD5) of its contents. The block regions which have same contents are held together a unique hash value name’s file and reduce total storage space. The block files are cached at a local storage and reusable. If the necessary block files exist on a local storage, the driver doesn’t requires network connection. The file name is also used to confirm its contents and keeps security. When contents are updated, the new file is created with a unique hash value name. Old block files are not necessary to delete and used to rollback the file system.
The performance of HTTP-FUSE CLOOP is sensitive to network latency, so we add 2 boot-options for HTTP-FUSE Xenoppix; Download-ahead and Netselect. Download-ahead downloads and caches necessary block files before HTTP-FUSE CLOOP driver requires them at boot time. Netselect finds the shortest download site among candidates. In this paper we report the performance to boot each OS with HTTP-FUSE Xenoppix.
Next vesion will include CPU virtualization technology and enable to boot more OSes without kernel modification. We also plan to include trusted boot offered TPM, because HTTP-FUSE Xeonppix aspires to becoming a trial environment of OSes for anonymous users.
The complete list of sessions presentations is available in volume 1 and volume 2.