What to Do When Your Cloud is Down
Posted by Bob Warfield on April 21, 2011
This post is on behalf of the Enterprise CIO Forum and HP.
As I write this, Amazon is having a major East Coast outage that has affected Heroku, Foursquare, Quora, Reddit and others. Heroku’s status page is just the sound of a lost sheep bleating repeatedly for its mother in heavy fog. What’s a poor sheep to do about this problem anyway? After all, isn’t a Cloud-based service dead once it’s Cloud is dead?
Rather than wringing our hands and shaking our heads about “That Darned Cloud, I knew this would happen”, let’s talk about it a bit, because there are some things that can and should be done. Enterprises wanting to adopt the Cloud will want to have thought through these issues and not just avoided them by avoiding the Cloud. In the end, they’re issues every IT group faces with their own infrastructure and there are strategies that can be used to minimize the damage.
I remember a conversation with a customer when I was a young Vice President of R&D at Borland, then a half a billion dollar a year software company (I miss it). This particular customer was waxing eloquent about our Quattro Pro spreadsheet, but they just had one problem they wanted us to solve: they wanted Quattro Pro not to lose any data if the user was editing and there was a power outage.
I was flabbergasted. “It’s a darned computer, it dies when you shut off the power!” I sputtered in only slightly more professional terms. Of course I was wrong and hadn’t really thought the problem through. With suitable checkpoints and logging, this is actually a fairly straightforward problem to solve and most of the software I use today deals with it just fine, thank you very much.
So it is with the Cloud. Your first reaction may be, “We’re a Cloud Service, of course we go down if our Cloud goes down!” But, it isn’t that black and white. I like John Dodge’s thought that the Cloud should be treated just like rubber, sugar, and steel. When Goodyear first started buying rubber from others, when Ford bought steel, and when Hershey’s bought sugar, do you think they didn’t take steps to ensure their suppliers wouldn’t control them? Or take Apple. Reports are that Japan’s recent tragedies aren’t impacting them much at all and that they’re absolutely sticking with their Japanese suppliers. This has to come down to Apple and their suppliers having had a plan in place that was robust enough to weather even a disaster of these proportions.
What can be done?
First, this particular Amazon outage is apparently a regional outage, limited to the Virginia datacenter. A look at Amazon’s status as I write this shows the West Coast infrastructure is doing okay:
Most SaaS companies have to get huge before they can afford multiple physical data centers if they own the data centers. But if you’re using a Cloud that offers multiple physical locations, you have the ability to have the extra security of multiple physical data centers very cheaply. The trick is, you have to make use of it, but it’s just software. A service like Heroku could’ve decided to spread the applications it’s hosting evenly over the two regions or gone even further afield to offshore regions.
This is one of the dark sides of multitenancy, and an unnecessary one at that. Architects should be designing not for one single super apartment for all tenants, but for a relatively few apartments, and the operational flexibility to make it easy via dashboard to automatically allocate their tenants to whatever apartments they like, and then change their minds and seamlessly migrate them to new accommodations as needed. This is a powerful tool that ultimately will make it easier to scale the software too, assuming its usage is decomposable to minimize communication between the apartments. Some apps (Twitter!) are not so easily decomposed.
This then, is a pretty basic question to ask of your infrastructure provider: “How easy do you make it for me to access multiple physical data centers with attendant failover and backups?” In this case, Amazon offers the capability, but Heroku took it back away for those who added it in their stack. I suspect they’ll address this issue pretty shortly, but it would’ve been a good question to explore earlier, no? Meanwhile, what about the other vendors you may be using that build on top of Amazon. Do they make it easy to spread things around and not get taken out if one Amazon region goes down? If not, why not?
Here’s the answer you’d like to hear:
We take full advantage of Amazon’s multiple regions. We’ll make it easy if one goes down for your app to be up and running on the other within an SLA of X.
Note that they may charge you extra for that service and it may therefore be optional, but at least you’ve made an informed choice. Certainly all the necessary underpinnings are available from Amazon to support it. Note that there are some operational niceties I won’t get into too deeply here, but I do want to mention in passing that it is also possible to offer a continuum of answers to the above question that have to do with the SLA. For example, at my last startup, we were in the Cloud as a Customer Service app and decided we wanted to be able to bring back the service in another region if the one we were in totally failed within 20 minutes and with no more than 5 minutes of data loss. That pretty much dictated how we needed to use S3 (which is slow, but automatically ships your data to multiple physical data centers), EBS, and EC2 to deliver those SLA’s. Smart users and PaaS vendors will look into packaging several options because you should be backed up to S3 regardless, so what you’re basically arguing about and paying extra for is how “warm” the alternate site is and how much has to be spun up from scratch via S3.
Another observation about this outage: it is largely focused on EBS latency, though there is also talk of difficulty connecting to some EC2 instances. This is the second time in recent history we’ve heard of some major EBS issues. We read that Reddit had gone down over EBS latency issues less than a month ago. Clearly anyone using EBS needs to be thinking about failure as a likely possibility. In fact, the ReadWriteWeb article I linked to implies Reddit had been seeing EBS problems for quite some time. One wonders if Heroku has too.
What will you do if you’re using EBS and it fails? Reddits says they’re rearchitecting to avoid EBS. That’s certainly one approach, but there may be others. Amazon provides considerable flexibility in the combination of local disk, EBS, and S3 to fashion alternatives. The trick is in making your infrastructure sufficiently metadata driven, and having thought throught the scenarios and tried them, sufficiently well-tested, that you can adapt in real-time when problems develop. In this respect, I have seem Netflix admonish that the only way to test is to keep taking down aspects of your production infrastructure and making sure the system adapts properly. That’s likely another good question to ask your PaaS and Cloud vendors–”Do you take down production infrastructure to test your failover?” Of course you’d like to see that and not just take their word for it too.
I haven’t even touched on the possibilities of utilizing multiple Cloud vendors to ensure further redundancy and failover options. It would be fascinating to see a PaaS service like S3 that is redundant across multiple data centers and multiple cloud vendors. That seems like a real winner for building the kind of services that will be resilient to these kinds of outages. It’s early days yet for the Cloud, even though some days it seems like Amazon has won. There’s plenty of opportunity for innovators to create new solutions that avoid the problems we see today. Even the experts like Heroku aren’t utilizing the Cloud as well as they should be.
Now is your chance!
This post is on behalf of the Enterprise CIO Forum and HP.
James Cohen has some good thoughts on how to work around Amazon outages.
I tweeted: “The beauty of Cloud: We can blame Amazon instead of our IT when we’re down. Except we really can’t: http://tinyurl.com/3hjhzr5“
Excellent discussion here about how Netflix has a ton of assets on AWS and was unaffected. In their words, they run on 3 regions and architected so that losing 1 would leave them running. As Netflix says, “It’s cheaper than the cost of being down.” Amen. I’m seeing some anonymous posts whining about the exact definition of zones versus regions, what’s a poor EU service to do, etc., etc.. Study Netflix. They’re up. These other services are down. Oh, and forget the anonymous comments. Give your name like a real person and don’t be a lightweight.
Lots of comments here and there also that multi-cloud redundancy is hard. Aside from the fact that this particular incident today didn’t require multiple clouds, consider that it is fantastically easier to go multi-cloud than it is to build multiple physical data centers. Salesforce.com was almost a billion dollar a year company before they built a second data center. Speaking of which, I bet they want to chat with the folks at Heroku now that they own them.
Clay Loveless gets failover in the Cloud. JustinB, not so much. Too ready to take Amazon’s word about their features. Makes me wonder if folks early to AWS who saw it buggy and got used to dealing with that are better able to deal with problems like today’s? When you run a service, it’s your problem, even when your Cloud vendor fails. Gotta figure it out.
Lydia Leong (Gartner) and I don’t always agree, but she’s spot on with her analysis of the Amazon “failure” and what customers should have been doing about it.
EngineYard apparently was set to offer multiple AWS regions in Beta, and accelerated it to mitigate AWS problems for their customers. Read their Twitter log on it. Would love to hear from some of their customers that tried it how well it worked.