Is over-capacity inevitable in cloud computing?

A lot of discussion is going on these days around some performance issues that Amazon customers are suffering with the Elastic Computing Cloud (EC2).

The discussion was triggered by Alan Williamson, a prominent voice in the Java community, who posted an interesting description of his 3-years experience with EC2. 
Williamson suggests that Amazon is allowing EC2 over-subscription at the point that the cloud is so crowded to generate some serious latency in the internal network, which impacts on the performance of any multi-tier application that resides on multiple virtual machines.

Another Amazon customer, David Mok, CTO at OleOle.com, suggests instead that the overall performance degradation depends on some differences (the CPU) in the physical hardware that is below the cloud, and that the cloud platform, the Amazon implementation of Xen, is incapable to fully abstract.

Christopher Hoff, former Chief Security Architect at Unisys and now Director of Cloud and Virtualization Solutions at Cisco, jumps in to comment the whole thing.
His very interesting point is that over-subscription is perfectly normal – modern networks are designed around such model – while over-capacity is an issue that we are going to have with cloud computing as much as we already have it with telecom networks.

He goes one highlighting that at today in cloud computing there’s nothing like a throughput SLA because:

…Your virtual interface ultimately is bundled together in aggregate with other tenants colocated on the same physical host and competes for a share of pipe (usually one or more single or trunked 1Gb/s or 10Gb/s Ethernet.) Network traffic in terms of measurement, capacity planning and usage must take into consideration the facts that it is both asymmetric, suffers from variability in bucket size, and is very, very bursty.

This complicates things when you consider that at this point scaling out in CPU is easier to do than scaling out in the network.  Add virtualization into the mix which drives big, flat, L2 networks as a design architecture layered with a control plane that is now (in the case of Cloud) mostly software driven, provisioned, orchestrated and implemented, and it’s no wonder that folks like Google, Amazon and Facebook are desparate for hugely dense, multi-terabit, wire speed L2 switching fabrics and could use 40 and 100Gb/s Ethernet today…