SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Fixing the Root Cause of Amazon’s S3 Outage

Posted by Bob Warfield on July 27, 2008

Details are here on how Amazon is fixing the root cause of the recent multi-hour S3 outage.  The long and short of it is that single bit corruption of the messages that describe the health of a server spread widely and forced a restart of the whole system.  Diagnosing the problem slowed them down and the restart itself was fairly slow.  Amazon is attacking all these angles by repairing the source of the corruption both for this particular issue and for other areas vulnerable to the same problem, as well as taking steps to make diagnosis and restart faster.

These are all good moves that will increase the robustness of the system quite a ways beyond just fixing the original bug.  That’s the right way to think about infrastructure: you need to fix the entire class of problems and not just the specific occurrence.

About these ads

One Response to “Fixing the Root Cause of Amazon’s S3 Outage”

  1. [...] Bob Warfield. Image via [...]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 323 other followers

%d bloggers like this: