Understanding AWS outages provides great insight into cloud system architecture, systemic failure modes and some of the gotchas, in any architecture, that can bite.

Key to this is the transparent way in which Amazon disclose their forensic analysis:
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region

Interestingly this DynamoDB issue is an example of a systemic failure in the control-plane (in this case a logical rather than physical one) reminiscent of the massive ec2/s3 failure they had a couple of years ago. Also, the simple architecture (polling rather than messaging or other async communication mechanism) shows Amazon’s approach of doing the simplest thing that will work independently and robustly — clearly necessary at the scale they operate at. Nevertheless, unexpected side-effects can still wreak havoc.

“… the dynamics of the cloud IaaS market (which is increasingly also convergent with the high-control PaaS market) are fundamentally those of a software market. It is a market in which customers are ultimately buying IT operations management software as a service, and as they go up-stack, middleware as a service.”

Lydia Leong, Gartner

Rackspace has thrown in the towel on unmanaged cloud services. Feeling pressure from the big boys (Amazon, Google and Microsoft) and the much smaller but VC funded VPS providers (Digital Ocean and Linode to name but two) they have hunkered down on Managed-Only services for new accounts.

With the VPS providers steadily dropping price/memory to $10/GB/Month and Amazon also inexorably driving prices down, Rackspace has been squeezed in the middle and is clearly looking to differentiate around it’s ‘fanatical support’ mantra—starting at an additional $50/month for a $20 server. Whether this will work is an open question. Low cost users are typically not looking for support, just a low price, and larger organizations using AWS et al either do it themselves or use a cloud management provider such as RightScale.

Following on from the bold assertion that 20% of IT departments would have no need for physical assets by 2012, Gartner have now turned their attention to the future of the Personal Computer. According to their research, “the Personal Cloud Will Replace the Personal Computer as the Center of Users’ Digital Lives by 2014“.

There is plenty of evidence of the truth of this already and the adoption of Drop Box, Google Docs, iTunes, iCloud, and a myriad mobile-centered iPhone apps has demonstrated the huge demand for always-on, always-connected mobile services.

Consumers, who also happen to be bosses and employees, are increasingly tech-savvy and their expectations are moving beyond what corporate IT departments can meet with existing infrastructure and skill sets. Users’ expectations of rapid change have been ameliorated somewhat by infrastructure technologies such as virtualization, which has improved IT’s operational agility, but the growing availability of consumer-centric mobile apps has led to a disconnect between what’s available in the consumer space and what companies provision for their staff. The lack of application development skills or a general understanding of the wider ramifications of mobile is becoming a visible weakness for many organizations.

Not surprisingly, employees take what’s available in the consumer world and bend it to their needs, often working around the compliance and security requirements of their employers and the command and control mindset of enterprise IT.

Writing in CIO magazine, Bernard Golden outlines some of the concepts that need to be understood when performing an OpEx versus CapEx calculation for IT infrastructure. For example, no-commitment OpEx (such as the classic Amazon AWS pricing model) should always cost more per service hour given that there is a cost to a no-commitment relationship1 that must be borne by the service provider. He uses the car-rental business as an example—which may not be the best analogy, but it makes the point.

Another question is that of utilization. Forrester analyst James Staten coined the term “Down and Off“, an idea somewhat analagous to switching off the lights in an empty room. Prior to the cloud, the argument goes, “Down and Off” is a) too hard to do, and b) there is little economic imperative to overcome the challenges to implementing it as the cost of computing is wrapped up in CapEx that has already been accounted for.

The difficulty in making use of Down and Off is what economists call “friction”, and one of the benefits of a highly automated cloud computing model is the elimination of barriers to reducing unwanted operational overhead.

As such costs change in response to technical innovation, Golden points out that

… input assumptions to financial analyses will change as IT organizations begin to re-evaluate application resource consumption models. Many application designs will move toward a continuous operation of a certain base level of resource, with additional resources added and subtracted in response to changing usage. The end result will be that the tipping point calculation is likely to shift toward an asset operation model rather than an asset ownership one.


1. Both Amazon and IBM, amongst others, offer reduced hourly rates for customers that sign-up for a fixed-length commitment period.

Richard Fichera at Forrester reports a fascinating development in the server marketplace and its big new customer—the large scale cloud computing environments that are now the biggest single purchasers of server hardware.

HP is creating a new “hyperscale business unit” to exploit the very low power ARM-based server designs being developed by Calxeda. According to Fichera, HP’s move is

“..based on the premise that very high-volume data centers will continue to proliferate, driven by massive continued increases in demand for web and cloud-based applications handling massive amounts of data, and that the trajectory of current systems technology with respect to power, cooling and density may be inadequate for emerging requirements.”

This all becomes particularly interesting given that Microsoft and NVIDIA demonstrated Windows 7 running on the NVIDIA Tegra (dual-core ARM @1.3 GHz) at CES 2011.

Ever since the original version of NT, Microsoft has ensured that Windows is portable across architectures and in the past have targeted MIPS, DEC Alpha and Itanium as well as the ubiquitous x86 and x86-64. The benefits of this commitment are now becoming apparent.

The recent Amazon outage has created some heated discussion as to whether Amazon’s services are enterprise ready or not. Much of the discussion seems to miss the point. For example, saying that Amazon is not enterprise-class is like saying an IBM x-server is not enterprise-class. Not very helpful and not very meaningful.

Amazon is a provider of compute and storage, like the aforementioned server. Give that server RAID direct-attached-storage, or dual-homing to a SAN, power from two UPSs and a mirror image of itself in another data center and you can perform synchronization between the two. Lo and behold, enterprise-class computing!

This can all be achieved with Amazon using different ‘Availability Zones’ in more than one Region and the appropriate software. And of course there is an associated price.

The reality is that the majority of Amazon’s clients are startups (many in the social networking space) that are willing to take the risk (or don’t comprehend it) in return for scalability, agility and above all the right price. Another significant group of clients are enterprises in search of cheap, agile compute for problems requiring mass horizontal scalability, but not persistence.

The really fascinating question behind this outage is the economic one, i.e. what level of risk/cost ratio are companies willing to tolerate for Information Technology.

Countless small enterprises that make heavy use of IT don’t have diesel backup and rely on their electrical utility to provide adequate uptime… sans SLA I might add. This is exactly the calculation that anyone using Amazon and its ilk is making–whether they are aware of it or not.

The cloud is all about economics—as are public electrical utilities—and we are in an important phase in the ongoing maturation of Information Technology: a field who’s economics have long been cloudy (pun intended) to say the least.

Randy Bias at CloudScaling has put together some interesting metrics on Amazon’s release cycle for the EC2 platform. The implication being that Amazon is growing its investment in EC2 at a significant rate to ensure it further enhances its already significant leadership position.

Based on prior years activity, Randy estimates 66 feature releases this year: or more than one per week.

EC2 feature release rate

He has also published the source data in a Google Doc and it makes fascinating reading.

Clearly Amazon want to stay on the crest of the cloud computing wave and they recognize that providing superior functionality is going to be critical to defend against a growing array of competition—-all of whom are presently tiny in comparison but still have the potential to compete on feature set, particularly in the enterprise IT and public/private cloud space.