We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.


On the Reliability of Cloud Computing

Originally published April 27, 2011

Last week’s outage in AWS (Amazon Web Services) EC2 service continues to make headlines. The high visibility outage impacted many companies that depend on Amazon for either primary or overflow computing and storage services. Impacting a large number of databases, applications and sites across the web, the outage shows just how pervasive use of cloud computing has become.

At Ajilitee, we have been using AWS and other cloud services for nearly three years. In fact, we run our business infrastructure entirely in the cloud and manage a hosted application for clients on a cloud-based platform. Notwithstanding the recent problem, availability and reliability of services have been very good. So good that I can understand how some may have been lulled into a false sense of security. After all, Amazon’s promise of 99.95% out-of-the-box uptime availability is indeed respectable.

So what went wrong and what are the implications of this failure? Will it create so much fear, uncertainty and doubt (the FUD factor) that it stalls the widespread adoption of cloud computing across industry, much the same way the Three Mile Island incident squelched the expansion of nuclear energy in the U.S.?

As for what went wrong, much has been written about the failure of Amazon’s multiple Availability Zones in its eastern region, so I won’t cover it again here. For the details, I refer you to Lydia Leong, Gartner Research VP, and an early blogger on the outage.

As for the impact of FUD on cloud adoption, I think there will be a few more rounds of hand-wringing on the risks of cloud, but in the end, a pragmatic perspective will prevail. After all, whether your applications are hosted in the cloud, delivered through your own data center, or managed by an outsourcer in a remote location, outages and downtime will occur. Such failures happen in all environments. That’s why we have disaster recovery plans! For cloud-based applications, one must simply have a disaster plan that is architected for the cloud environment. Creating such a plan entails weighing the risks versus the rewards of various decisions. Let me explain what I mean.

Whitebox versus Blackbox

There are two approaches to exposing the implementation details of services. Blackbox services shield consumers from the implementation details; whitebox services provide transparency and visibility into one or more layers of implementation. Regardless of the approach, architects need to understand the potential points of failure and weigh the risks versus the costs of prevention.

For example, many data warehouse implementations are not architected and engineered for high availability because the businesses cannot justify the cost of maintaining full redundancy. The implication is that a major storage failure could impact data warehouse service levels for days or even weeks.

Will you apply different availability standards to a data warehouse hosted in the cloud as opposed to one run out of your data center? You may well, because your decision to use the cloud may include a goal of improved service levels to the business through better availability options. But do examine your thinking and what is really necessary for your recovery operations.

Two important aspects to consider in a data warehouse recovery scenario are:
  1. How long can the sources or staging processes continue during a failure scenario without losing data?

  2. Once operation is restored, how much spare capacity is required and how long will it take for the data warehouse systems to get caught up?

Interdependency

Service level and availability management are two key dimensions of data warehouse and business intelligence (BI) systems that often are not well understood or executed even in legacy data centers. One contributing factor is the often complex web of interdependencies between the data sources, BI systems, and data consumers.

The same principle holds for cloud services – the more abstracted or higher level a service, the more complex the implementation and the greater number of interdependencies that must be considered when engineering the implementation. Today’s sophisticated managed analytics environments combine multiple servers and services including ETL, data quality, database, analytics tools, applications, and portal-based web services, all of which must be coordinated in a consistent configuration, and any one of which, if misconfigured, can impact the quality of service.

Capacity Guarantees

When a single server fails and the cloud service has spare capacity, it is usually a simple process to instantiate a replacement server from backups. However, when many instances fail concurrently (or when the monitoring and control system thinks they have failed, even if they haven’t), simultaneous failover of hundreds or thousands of services can overload an infrastructure. As of the time of writing, Amazon has not yet identified a root cause for the outage, but from early reports, this is likely to be a key exacerbating factor. I should point out that this spare capacity problem exists in other virtualized environments, and even legacy server architectures.

It is for this reason that disaster recovery plans usually specify processes and elapsed times to acquire replacement capacity, whether it be data center power and cooling, servers, storage rebuilds, software rekeying, etc. Short of reserving capacity and replicating data at independent locations, there is no guarantee that a given service will have spare capacity in the face of a major outage to a cluster, storage system, network, or data center, or that consumers will have connectivity. Some business-critical systems may require this level of redundancy; and while there is always an extra cost to be paid for redundancy, cloud computing can provide this capability at lower cost by leveraging the buying power of the cloud provider.

The Fallout

Although a high impact outage will cause temporary pain, I think that ultimately cloud providers will respond by improving reliability and mitigating single points of failure. Cloud customers will respond by taking more responsibility for implementing application and database service levels and recovery processes that consider the cloud environment and leverage the capabilities of the cloud provider. One recent example of lessons learned was published by Stephen-Nelson Smith, which entreats cloud users to design for downtime and use the range of free and paid-for tools Amazon supports.

Here in our own business at Ajilitee, we weathered the storm pretty well, as our disaster recovery plan kicked into gear to minimize our exposure. Still there were lessons here, too, and we are reviewing our business continuity plan to ensure even better availability. As both cloud providers and cloud customers learn from experience, they will get better at managing expectations, weighing risk vs. reward, and making decisions that make doing business in the cloud FUD-free.

Your Thoughts?

So what do you think will be the fallout of this cloud computing storm? Do you agree with me or see a more cautious customer in cloud’s future? Please add your comments to this article or send them to me by email.

  • John BairJohn Bair

    John is Chief Technology Officer at Ajilitee, a consulting and services firm that specializes in business intelligence, information management, agile analytics, and cloud enablement. John’s technology career includes leadership positions at companies such as HP, Knightsbridge Solutions, and Amazon.com. He has decades of experience building complex information management systems and is an inventor on six data management patents.  

    Editor's Note: Find more articles and resources in John's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by John Bair

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!