How Microsoft and VMware use virtualization internally

microsoft logo

vmware logo

Who better than a virtualization vendor to show a successful case study to convince prospects to buy?

In May 2008 Microsoft published some details about how it’s using Hyper-V to serve MSDN and TechNet IIS7 web front-ends.

Nor VMware neither Citrix or other vendors ever published any information about their in-house implementations.
Anyway juicy additional details recently emerged about both the Microsoft and the VMware data centers.


How Microsoft is really using Hyper-V

The MSDN and TechNet case studies were interesting but lacked many details. A new document published in January 2009 on the TechNet library now tells a much clever (and in some cases concerning) story:

As early as September 2004, Microsoft IT calculated that the average CPU utilization for servers in data centers and managed lab environments was less than 10 percent, and continuing to decrease.

The virtualization goals are set very high for Microsoft IT, which has deployed more than 3,500 virtual machines. By June 2009, Microsoft IT plans to have 50 percent of all server instances running on virtual machines. With Windows Server 2008 Hyper-V, the expectation is that at least 80 percent of new server orders will be deployed as virtual machines.

As Microsoft IT developed standards for which physical machines to virtualize, it identified many lab and development servers with very low utilization and availability requirements. Because of the lower expectations, Microsoft IT now is deploying the lab and development virtual servers with four processor sockets, 16 to 24 processor cores, and up to 64 gigabytes (GB) of random access memory (RAM). These servers can host a large number of virtual machines, averaging 10.4 virtual machines per host machine.

As Microsoft IT developed its expertise in deploying virtual machines, and especially with the performance improvements available with Windows Server 2008 Hyper-V, it has increasingly moved toward virtualization of production servers. Although many production servers still have low utilization, some have significantly higher performance requirements than the lab and development computers. For the production-server deployments, Microsoft IT is using servers with two processor sockets, 8 to 12 processor cores, and 32 GB of RAM.

On average, the host servers with eight processors and 32 GB of RAM are hosting 5.7 virtual machines in the production environment.

Microsoft IT configures all virtual machine hosts to use a SAN to store the virtual machine configuration and hard disk files. The host computers connect to the SAN by using dual-path Fibre Channel host bus adapters (HBAs). For production virtual servers, the SAN storage uses redundant array of independent disks (RAID) 0+1, whereas RAID 5 is used for lab and development virtual machine storage. Microsoft IT has chosen the RAID 0+1 configuration for the production servers because it provides better performance, but it does consume more disks. Performance is not as critical in the lab environment, so Microsoft IT uses RAID5 because it uses fewer disks to store the virtual machines.

When Microsoft IT first deployed server virtualization, the goal was to use a shared storage model for the virtual machines. During the first iteration, Microsoft IT would create one or two large logical unit numbers (LUNs) on the SAN (100-plus GB) for each host computer and then deploy multiple virtual machines per LUN. In a typical scenario, Microsoft IT gave the customer a 50-GB drive C and a 20-GB drive D. Because both drives were dynamic virtual disks, the actual space used on the LUN was much less than the maximum size.

However, over time, the dynamic disks grew as the customers stored data on the virtual servers, and just two or three virtual machines could fill an entire LUN. This became a significant management issue for Microsoft IT, which had to track all LUNs for space availability and then move virtual machines before all space was utilized.

To address this issue and to enable failover clustering for the virtual machines, Microsoft IT next adopted a model of configuring just a single virtual machine per LUN. With this model, a LUN with 30 to 50 GB was dedicated to each virtual machine, with the option to give the virtual machines more space as required.

Microsoft IT has avoided using disk mount points, so the limiting factor for the number of virtual machines deployed on a host became the number of available drive letters on the host computers. In most cases, this meant not deploying more than 23 virtual machines on a host.

  • Microsoft IT has achieved 99.95 percent availability for virtual machines running on Microsoft Virtual Server 2005 R2, and it anticipates that the availability will increase for virtual machines running on Hyper-V. Very few applications that have been deployed as virtual machines require a higher availability level.
  • With Windows Server 2008 failover clustering, an administrator must store each virtual machine on an individual LUN. Because an administrator must provide all cluster nodes with access to the same shared storage by using the same drive letters, 23 is the maximum number of virtual machines that can run in a failover cluster. Microsoft IT could work around this limitation by using mount points and virtual machine groupings, but it considers this configuration too complex to administer. Because of this limitation, Microsoft IT has adopted a standard of using only three nodes in a cluster, with the cluster configured to tolerate one node’s failure.
  • When virtual machines fail over in a Windows Server 2008 failover cluster, the cluster service with Hyper-V must save the virtual machine state, transfer the control of the shared storage to another cluster node, and restart the virtual machine from the saved state. Although this process takes only a few seconds, the virtual machine still is offline for that brief period. If an administrator has to restart all hosts in the failover cluster because of a security update installation, the virtual machines in the cluster have to be taken offline more than once. Therefore, Microsoft IT determined that highly available virtual machines could have more downtime than virtual machines deployed on stand-alone servers in the case of simple planned downtimes for host maintenance, such as applying software updates.
  • Because of the required brief outage every time a virtual machine is moved from one host to another, Microsoft IT found that coordinating the server update processes with virtual machine owners was difficult. Because one physical host could contain several virtual machines, Microsoft IT had to communicate with each of the virtual machine owners and coordinate host server maintenance with virtual machine maintenance.

Because of these issues, Microsoft IT has not deployed failover clustering as the default standard for virtual machines. Microsoft
IT has deployed several three-node clusters and does provide this service for virtual machines running critical workloads. One of the places where Microsoft IT is using failover clustering for virtual machines is in some branch offices that do not have 24-hour support staff on site. In a data center where administrators always are available to react to host downtime, Microsoft IT has minimized the use of Hyper-V clustering…

The whole article is priceless and its read is highly recommended (thanks to Vinternals for the link).
For the lazy ones Microsoft even published a webcast about this internal case study. The presenter is David Lef, a Microsoft IT Technology Architect at Microsoft.

How VMware is really using ESX

As we said, despite its leadership, VMware never revealed how it’s using ESX and other products internally.
The fist time ever that the company disclosed details about its virtual infrastructure was in September 2008 at VMworld US. A refreshed presentation (DC35) was shown during the VMworld Europe 2009 by Tayloe Stansbury, the company CIO.

  • VMware has an internal VDI deployment of over 550 users, including members of most departments.
    The client configuration includes Wyse V10 Thin Clients, Dell 24” monitors (configured at 1920×1200 pixels, 15bit resolution), keyboard and mouse.
    The server configuration runs on HP c7000 blade systems, EMC Clariion CX3-80 storage and Cisco 3020s switch modules for the HP blades.
    The entire infrastructure is powered by VMware Virtual Desktop Manager (VDM) 2.1 for US and View 3.0 for Europe.
  • VMware has an internal virtualized mail server deployment serving 7800 mailboxes.
    The entire infrastructure is powered by 29 virtual machines (split in two data centers) running Microsoft Exchange 2007 Enterprise Edition. 22 of them are just for the mailboxes, the other 7 work as Client Access Servers (CAS).
  • VMware virtualizes its entire ERP infrastructure except Oracle Real Application Clusters (RAC). 
  • 97% of the company servers are virtualized across one Tier 4 and two Tier 2 data centers.
    Just two applications are missing (one is Oracle RAC).
    EMC DMX4 is used as the storage backend of choice for mission-critical applications, otherwise EMC CX3-80 is the choice.
    The front-end servers of choice are HP c7000 blades everywhere.
  • The average consolidation ration is 10:1 for server and 64:1 for VDI desktops
  • Each administrator manages an average of 145 virtual machines.

For the ones that cannot access the VMworld presentations (it requires a yearly subscription) VMware published a webcast about this internal case study.