Free insights from AWS forensics

Understanding AWS outages provides great insight into cloud system architecture, systemic failure modes and some of the gotchas, in any architecture, that can bite.

Key to this is the transparent way in which Amazon disclose their forensic analysis:
Summary of the Amazon DynamoDB Service Disruption and Related Impacts in the US-East Region

Interestingly this DynamoDB issue is an example of a systemic failure in the control-plane (in this case a logical rather than physical one) reminiscent of the massive ec2/s3 failure they had a couple of years ago. Also, the simple architecture (polling rather than messaging or other async communication mechanism) shows Amazon’s approach of doing the simplest thing that will work independently and robustly — clearly necessary at the scale they operate at. Nevertheless, unexpected side-effects can still wreak havoc.