Bug or Architecture Flaw? (Fail or No Fail)
Posted by Bob Warfield on April 23, 2008
Blaine, I feel for you. People expect and are pretty tolerant of a few bugs, but the problems at Twitter have been going on for long enough that it’s clear there were deep-seated architectural flaws that were not going away very soon. Twitter is taking the right steps–they’ve got a new VP of Engineering and Ops as well as two new scaling experts. Cook, rightly or wrongly is firmly under the bus.
What follows next is important. The new gang has a limited window in which to fix the problem. This won’t be easy. Fixing deep architecture issues on a live system that can’t keep up is one of those nightmare scenarios that’s painful beyond belief.
What can we learn from this?
First, Twitter is just the latest example of an important service that has all the ingredients for success except for the ability to scale properly.
I gave up for the last time on Technorati not long ago for similar reasons. For a long time it was my blogging hub. I used it for search and to monitor how well my own messages were penetrating the blogosphere. But it was wildy inconistent. It was easy to switch to Google for Blog Search, after all, they are the search experts? But that Technorati Authority seemed like it was worth hanging around for.
And then one day my Authority dropped almost 100 points. In one day I went from over 300 to just a little over 200. What’s up with this? I waited for it to come back–at various points in the past, something similar had happened and then corrected itself in a day or two. No such luck.
Eventually, I stayed away long enough, that it was time to log in again. I didn’t remember my account info, so I simply searched for SmoothSpan to find my blog. There was my answer for what had happened: I saw two SmoothSpans! One had my old over 300 authority, one the new authority.
But I could tell neither was really right. In other words, the true authority was some mix of sites from one and some from the other. Thinking about how this could happen revealed a classic architecture flaw: Technorati had created more than one record for the same thing and couldn’t keep them straight.
Twitter’s situation is similar. Supposedly they were rolling out a new caching system of some kind when their latest troubles hit. Caches create more than one copy of the data intentionally, to make it easier to scale. The trick is to keep it all running smoothly and to feed the cache from the one true version of the data.
Second learning point: It may be a bad idea to worry about scaling later. This topic has been debated from time to time. Some have advanced the notion that to worry about scaling up front is a premature optimization. Scaling is not a premature optimization! It is fundamental architecture. The number of developers who can deliver a highly scalable web property is a tiny fraction of the number of developers who can get “almost there”. The difference between having one architecture versus the other is a healthy heaping of FAIL.
My last company, Callidus Software, understood scaling. We got it so well it became a major differentiator for the product. We could literally go to customers where the likes of Oracle and SAP (who supposedly understood scaling) could not go because their systems wouldn’t handle the scale. There’s nothing quite like being the only game in town for a big customer that has to have a solution.
Scaling is something the Cloud Platform world may eventually deliver for us. So far, they are more about Utility Computing, which is the ability to add more machines quickly and easily. That’s Amazon’s model. Whether your software can use more machines (i.e. whether it scales), is up to you. Teasing apart the aspects of an application that make it scalable and handing them over to a platform will be a ticklish business. It’s likely to a good deal of rewriting. But either way, your team can get it right the first time, do a rewrite under fire, or rewrite for a Cloud Platform that shows how its done.
Does your team understand scaling? Really? How do you know?
A recent interview with Blaine Cook. Interesting note on eventual consistency: Twitter only allows API’s to update once per minute. The team is 5 full-time developers and that includes 1 new person. Tight team, but that’s good.
Michael Arrington mentions he read this post in a Seesmic video. Scan down through the comments to see it. He’s involved in a big slugfest over whether this was character assassination on Cook, whether the fault lies with Ruby on Rails, and so on and so forth. It’s important not to lose track in all of that emotional content of the main issue here, which is that scaling matters, a relatively small set of developers have lived through it and know what to do, and it is hard to fix after the fact.
Despite the fact that there’s basically a flamewar on Techcrunch over this, others seem to reach a similar conclusion. Larry Dignan has a good post over at ZDNet.