Saturday, July 26, 2008

Cloud Computing and High Availability

Last week, the fail whale, a concept that has become associated with the recurring Twitter service’s outages, swam across the north pacific and hit Amazon’s S3 service. I am talking about the already widely discussed outage of Amazon’s S3 service. It is fair to say, the services dependent on the Amazon’s S3 services – i.e. polvore.com – really felt the “business and user impact” of the outage. Did the users of those dependent services really care that those services were using Amazon’s S3 to save costs? Of course, they do not. The dependent services wrote apologizing blog entries, and never-ending debates of pros of cons of the cloud computing started yet again. But I won’t bore you with yet another synopsis on the outage.


UPDATE: Yesterday, Amazon did a great job of being transparent with the issue that caused the outage.


However, as a technology product leader, who also runs a software-as-a-service product at IBM, I am always faced with new challenges related to the shared application code based and more importantly, the shared application infrastructure. It is a no-brainer that specialized services (i.e. Amazon’s S3) always can do better job at lower costs than the individual internal IT services could do and cost. But at the same time, most people do forget to realize that the more clients the cloud-based services get, the more the impact will be felt during an outage. Therefore, with the increased usage of the service, the tolerance of a failure goes to zero, and uptime expectations go through the roof. Mathematically, we can represent it as: Cloud computing uptime expectations = number of clients x cost the service. Amazon’s S3 had an outage. But is that an anomaly? No. If your answer is yes, you have never run a large-scale system. However, the impact of the Amazon’s S3 service was unbearable to most of its clients. Again, please keep in mind, cloud-based storage means nothing to the users of FriendFeed, Twitter or Polyvore.com.


I am a big proponent of both infrastructure cloud computing services and software-as-a-service applications. However, this Amazon’s S3 outage got me thinking as how we as an industry could come-up with a solution. We know it does not matter how much redundancy a distributed cloud-based system has, some day, some thing does break. So, the obvious armchair architects’ solution of having redundancy of disks, servers, unbreakable distributed system design and other infrastructure elements just won’t avoid another outage.


I think one possible solution could be as the interoperability of the cloud-based infrastructure services. The concept is analogous to SMTP and POP protocols for the email-based services. Let’s take an example of online storage. Amazon S3 and participating competitors would agree on a standard API to retrieve and store data in the cloud. Users would select the service based on their criteria initially. S3 and its competitors could offer an “extra insurance” of redundant cloud storage feature at the time sign-up. With the feature, the users could choose the cloud of a competitor of the selected company as a “redundant” cloud in case the selected company’s cloud fails.


Now, this solution has not gone through any deep analysis and is more of a random thought. But I do wonder the other factors that could play into it. The companies would have to compete hard to keep the customers as they will be one click from switching to the competitor and perhaps making you the “redundant” cloud. Another factor could how someone would cost the service of being redundant? X% of the primary service and full charges during the failure of the primary provider? Also, what would the economical advantage for the companies that interoperate with each other versus the ones who don’t cooperate? Open source foundations – i.e. Apache Software – have pioneered the standardizations among a lot of locally installed software. Will we need a similar foundation to manage the cloud-based services interoperability?

2 comments:

Anonymous said...

Hi,

I just posted an article on 'Cloud Availability' at the following URL, would like to hear your comments on the same.
http://mukulblog.blogspot.com/2008/07/cloud-availability.html

Thanks,
Mukul.

Anonymous said...

great work for me.
thanx