You don’t have to be a techie to have heard about the challenges Amazon’s faced with its EC2 elastic cloud offering over the weekend. States of emergency were declared up and down the eastern seaboard after a destructive line of thunderstorms knocked out power in over 4 million homes — including Amazon’s East Coast cloud data center. The outage took down Netflix, Pinterest, Instagram and well-known cloud applications, as well as cutting off access to countless numbers of businesses whose names we might never know.
Since then, media and social media have been speculating on what this event says about the future of the cloud. It culminated in an article on Sunday in the New York Times by Quentin Hardy, in which he states that the outage “underlined how businesses and consumers are increasingly exposed to unforeseen risks and wrenching disruptions as they increasingly embrace life in the cloud.”
Seriously? This kind of sensational reporting is certainly great link bait, but it’s flat out inaccurate. Don’t get me wrong. I’m happy to elaborate on the challenges of hosting mission-critical applications in the public cloud. But to treat a regional disaster as a referendum on cloud risk reflects a lack of understanding on the part of the people who make such claims.
First of all, a state of emergency was actually declared in that region. And while one should ask what happened to Amazon’s failover power procedures, the reality is that there is no question that Amazon does data centers better than the average business does their in-house data room. So while customers should rightfully ask the tough questions of their provider, what’s the alternative?
Would these businesses have been better served with a backup power supply in an on-site data center in a small office park in Northern Virginia? The reality, as The New York Times’ Hardy alludes, is that most people in the eastern U.S. could post pictures to Instagram of their kids playing Monopoly by candlelight (from their iPhone) long before they could actually turn the lights on in their home. Amazon has to be doing something right.
Second, these businesses (Netflix, Instagram, Pinterest) likely chose to rely on one data center. One of the biggest misnomers about the cloud is that buying one server with Amazon means it automatically fails over between different geographic locations. That is just not the case.
Amazon offers high availability and equipment redundancy within an “availability zone” (location), that is far better than most customers can afford to put in their own offices. But this product does not scale locations. If you want to ensure that an application survives a regional disaster, you need to buy servers in more than one location.
That means server and resource requirements times two, or three — based on the cost and your appetite for risk. And then the business needs to ensure that the application is prepared to be load balanced in that way, that the data is replicated between sites, and that the customer experience is not affected. For the techies reading this post, you know this isn’t easy.
You might be asking yourself how these businesses could have been so short sighted. Are they just being “cheap,” as Michael Lee suggested on ZDnet?
They’re not being cheap. They’re smart business people. Instagram and Pinterest are free services provided to end users. They’re not managing mission-critical environments, such as the International Space Station; heart lung machines; or emergency response applications. People couldn’t post a picture of their dinner for a few hours? I’m sorry, but that’s not exactly a cloud computing crisis. Businesses make decisions about disaster recovery and continuity based on business risk.
With a free product, there is no short term revenue risk. They’re already doing a better job for their customers by hosting with Amazon than doing it themselves. A single data center design with redundant and scalable architecture run by the company credited with inventing the public cloud is a reasonable amount of continuity planning.
And even if they were hosted in multiple availability zones, an outage of this size is large enough to cascade problems not only in their application and data integrity during failover, but also across the Internet in general.
We might never know the details about everything that transpired at Amazon last weekend or about the standards in Instagram’s continuity plan.
But we do know one thing: The storms of last weekend should be viewed as validation for cloud computing, not an indictment of it.Categories: General