Amazon Ran Out of Capacity
Posted by Bob Warfield on February 18, 2008
As I suggested in my original post on the topic, Amazon’s recent S3 outage was due to running out of capacity. Specifically, they ran out of authentication capacity. In part, this problem was due to the fact that Amazon wasn’t monitoring exactly this part of their capacity envelope very well. High Contrast has the Amazon quote telling us that it was also due to just a few customers radically increasing their load on the system in an unpredictable way:
the surge was caused by at least one very large customer plus several other customers suddenly and unexpectedly increasing their usage.
So far, most of the pundits are in something of a denial mode. They argue that nothing really new and interesting is happening here. All services go down, including the electric company. Vinnie Merchandani says corporate data centers have been going down a lot more often than 99.999% uptime allows for since forever. Folks like Nick Carr seem to feel the biggest issue in this outage was that users didn’t have timely information and Amazon is fixing that.
This all misses a bigger point. What these writers are doing is attempting to apply the old standards and methods against the new world of Cloud Computing. The trouble is, there is something genuinely new at work here that goes beyond the inevitability of some outages and the need to be more transparent with customers about what is going on. The problem Amazon and other would-be cloud platform purveyors face is predictability. The world they deal in is radically less predictable than corporate data centers of old because the Internet today has much lower friction and higher connectivity between different web sites that make load spikes increasingly sudden and intense. There is a cascade of dominoes effect that is enabled by the low friction web that wasn’t nearly so twitchy in the past.
The premise of any large computing infrastructure is that by sharing the load across many customers (and in Amazon’s case, sharing excess capacity from their core retail business), we enable headroom for such load spikes. But how realistic is that concept?
Consider this Alexa plot of CNN and Flickr traffic over time:
Do these two curves look predictable to you? Take CNN, for example. To handle the big spikes requires 2-3x overload capacity. Flickr is a little less crazy except for one massive event that involved a doubling in a very short time. This latter even was permanent in its effect, so if you were counting on temporarily borrowing some headroom, you would have had to keep it in place indefinitely and grow from there. Ironically, that chart was brought to my attention at Amazon Startup Project where they used it to sell the idea of unlimited headroom a startup can’t afford to purchase by using Amazon Web Services.
These charts are displaying non-linear behaviour, the hardest of all phenomena to predict. This non-linearity is becoming more and more common because the Internet has become extremely viral. It is crosslinked, the very meaning of the word “web”, and messages travel along the links with almost no friction. Viral has become a virtue, and much of the current innovation is focused around how to make the viral spread of information more likely. Social Networks are all about such behaviour. Take a look again at those CNN spikes. Now let’s imagine your cloud computing infrastructure is hosting a bunch of different blogging, micro-blogging, video, photo sharing, and other social sites. The CNN spikes no doubt represent something newsworthy happening. The greatest likelihood is that each spike will be echoed at some level across all of these sites that are in the business of spreading information. Friction has been lowered to the point it is almost non-existent when it comes to the spread of memes on the Internet. We have major spikes from world events, such as the assassination of a world leader. In the Internet, we can have major spikes from such inane moments as Scoble shedding tears of delight over new Microsoft secret software. And the whole thing is wired together. That one tear on Scoble’s cheek breeds a thousand or more accounts ranging from poking fun to trying to guess what this secret software is. There is a ravenous beast poised over the keyboard waiting for something interesting to pass onto its network of other ravenous beasts.
This is decidedly non-linear behaviour and impossible to predict. The answer is major cloud computing infrastructure providers will need to have considerable excess capacity available on tap at all times to avoid outages. Take Amazon. Web bandwidth to their web services now exceeds to total traffic to all of their other properties. What might have once been a nice remaindering business allowing them to resell their excess capacity is now driving the need for more capacity. They have just a few choices. They can invest in a lot more hardware and lower the margins on their business, or they can implement some strategies to limit the availability of the service to some customers. It strains credulity to think they’ll limit capacity to their retail business. How will they decide? Tiered pricing of some kind?
Think in terms of other unexpected networked events. I’m reminded of financial markets and the law of unintended consequences. Look at today’s housing market. Remember Long Term Capital, a hedge fund with Nobel Laureates who had mathematical proofs they would continue making money. Right up until they unpredictably went bankrupt. BTW, this sort of thing used to happen with the electrical grid too. In both cases, the financial markets and the electrical grid, elaborate means were put into place to artificially inject friction to damp the machine’s oscillations before it could destroy itself. There are elaborate rules in the stock exchanges about shorting stocks that are falling. They inject a form of friction back into those markets to prevent total free fall.
Perhaps this points the way to new technology for Cloud Computing infrastructure. A gentle injection of the right kind of friction at the right point for a limited time might prevent suddenly massive spikes and outages. It’s an area ripe for innovation. Meanwhile, Amazon could sorely use some competition. If a customer could contract for emergency capacity from elsewhere, or even better, if the Cloud Computing Providers could share slack capacity as the electrical companies do, it would be tremendously helpful when the inevitable load spikes arrive.