Fixing the Root Cause of Amazon’s S3 Outage
Posted by Bob Warfield on July 27, 2008
Details are here on how Amazon is fixing the root cause of the recent multi-hour S3 outage. The long and short of it is that single bit corruption of the messages that describe the health of a server spread widely and forced a restart of the whole system. Diagnosing the problem slowed them down and the restart itself was fairly slow. Amazon is attacking all these angles by repairing the source of the corruption both for this particular issue and for other areas vulnerable to the same problem, as well as taking steps to make diagnosis and restart faster.
These are all good moves that will increase the robustness of the system quite a ways beyond just fixing the original bug. That’s the right way to think about infrastructure: you need to fix the entire class of problems and not just the specific occurrence.