SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Fixing the Root Cause of Amazon’s S3 Outage

Posted by Bob Warfield on July 27, 2008

Details are here on how Amazon is fixing the root cause of the recent multi-hour S3 outage.  The long and short of it is that single bit corruption of the messages that describe the health of a server spread widely and forced a restart of the whole system.  Diagnosing the problem slowed them down and the restart itself was fairly slow.  Amazon is attacking all these angles by repairing the source of the corruption both for this particular issue and for other areas vulnerable to the same problem, as well as taking steps to make diagnosis and restart faster.

These are all good moves that will increase the robustness of the system quite a ways beyond just fixing the original bug.  That’s the right way to think about infrastructure: you need to fix the entire class of problems and not just the specific occurrence.

One Response to “Fixing the Root Cause of Amazon’s S3 Outage”

  1. […] Bob Warfield. Image via […]

Leave a Reply

 

Discover more from SmoothSpan Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading