Amazon’s new SimpleDB offering, like many other post-modern databases such as CouchDB, offers massive scaling potential if users will accept eventual consistency. It feels like a weighty decision. Cast in the worst possible light, eventual consistency means the database will sometimes return the wrong answer in the interests of allowing it to keep scaling. Gasp! What good is a database that returns the wrong answer? Why bother?
Often waiting for the write answer (sorry, that inadvertant slip makes for a good pun so I’ll leave it in place) returns a different kind of wrong answer. Specifically, it may not return an answer at all. The system may simply appear to hang.
How does all this come about? Largely, it’s a function of how fast changes in the database can be propogated to the point they’re available to everyone reading from the database. For small numbers of users (i.e. we’re not scaling at all), this is easy. There is one copy of the data sitting in a table structure, we lock up the readers so they can’t access it whenever we change that data, and everyone always gets the right answer. Of course, solving simple problems is always easy. It’s solving the hard problems that lands us the big bucks. So how do we scale that out? When we reach a point where we are delivering that information from that one single place as fast as it can be delivered, we have no choice but to make more places to deliver from. There are many different mechanisms for replicating the data and making it all look like one big happy (but sometimes inconsistent) database, let’s look at them.
Once again, this problem may be simpler when cast in a certain way. The most common and easiest approach is to keep one single structure as the source of truth for writing, and then replicate out changes to many other databases for reading. All the common database software supports this. If your single database could handle 100 users consistently, you can imagine if those 100 users were each another database you were replication to, suddenly you could handle 100 * 100 users, or 10,000 users. Now we’re scaling. There are schemes to replicate the replicated and so on and so forth. Note that in this scenario, all writing must still be done on the one single database. This is okay, because for many problems, perhaps even the majority, readers far outnumber writers. In fact, this works so well, that we may not even use databases for the replication. Instead, we might consider a vast in-memory cache. Software such as memcached does this for us quite nicely, with another order of magnitude performance boost since reading things in memory is dramatically faster than trying to read from disk.
Okay, that’s pretty cool, but is it consistent? This will depend on how fast you can replicate the data. If you can get every database and cache in the system up to date between consecutive read requests, you are sure to be consistent. In fact, it just has to get done between read requests for any piece of data that changed, which is a much lower bar to hurdle. If consistency is critical, the system may be designed to inhibit reading until changes have propogated. It take some very clever algorithms to do this well without throwing a spanner into the works and bringing the system to its knees performance-wise.
Still, we can get pretty far. Suppose your database can service 100 users with reads and writes and keep it all consistent with appropriate performance. Let’s say we replace those 100 users with 100 copies of your database to get up to 10,000 users. It’s now going to take twice as long. During the first half, we’re copying changes from the Mother Server to all of the children. The second half we’re serving the answers to the readers requesting them. Let’s say we can keep the overall time the same just by halving how many are served. So the Mother Server talks to 50 children. Now we can scale to 50 * 50 = 2500 users. Not nearly as good, but still much better than not scaling at all. We can go 3 layers deep and have Mother serve 33 children each serve 33 grand children to get to 33 * 33 * 33 = 35,937 users. Not bad, but Google’s founders can still sleep soundly at night. The reality is we probably can handle a lot more than 100 on our Mother Server. Perhaps she’s good for 1000. Now the 3-layered scheme will get us all the way to 333*333*333 = 36 million. That starts to wake up the sound sleepers, or perhaps makes them restless. Yet, that also means we’re using over 100,000 servers too: 1 Mothers talks to 333 children who each have 333 grandchildren. It’s a pretty wasteful scheme.
Well, let’s bring in Eventual Consistency to reduce the waste. Assume you are a startup CEO. You are having a great day, because you are reading the wonderful review of your service in Techcrunch. It seems like the IPO will be just around the corner after all that gushing does it’s inevitable work and millions suddenly find their way to your site. Just at the peak of your bliss, the CTO walks in and says she has good news and bad news. The bad news is the site is crashing and angry emails are pouring in. The other bad news is that to fix it “right”, so that the data stays consistent, she needs your immediate approval to purchase 999 servers so she can set up a replicated scheme that runs 1 Mother Server (which you already own) and 999 children. No way, you say. What’s the good news? With a sly smile, she tells you that if you’re willing to tolerate a little eventual consistency, your site could get by on a lot fewer servers than 999.
Suppose you are willing to have it take twice as long as normal for data to be up to date. The readers will read just as fast, it’s just that if they’re reading something that changed, it won’t be correct until the second consecutive read or page refresh. So, our old model that had the system able to handle 1,000 users, and replicated to 999 servers to handle 1 million users used to have to go to 3 tiers (333 * 333 * 333) to get to the next level at 36 million and still serve everything consistently and just as fast. If we relax the “just as fast”, we can let our Mother Server handle 2,000 at half the speed to get to 2000 * 1000 = 2 million users on 3 tiers with 2000 servers instead of 100,000 servers to get to 36 million. If we run 4x slower on writes, we can get 4000*1000 = 4 million users with 4000 servers. Eventually things will bog down and thrash, but you can see how tolerating Eventual Consistency can radically reduce your machine requirements in this simple architecture. BTW, we all run into Eventual Consistency all the time on the web, whether or not we know it. I use Google Reader to read blogs and WordPress to write this blog. Any time a page refresh shows you a different result when you didn’t change anything, you may be looking at Eventual Consistency. Even if you suspect others changed something, Google Reader still comes along frequently and says an error occured and asks me to refresh. It’s telling me they relied on Eventual Consistency and I have an inconsistent result.
As I mention, these approaches can still be wasteful of servers because of all the data copies that are flowing around. This leads us to wonder, “What’s the next alternative?” Instead of just using servers to copy data to other servers, which is a prime source of the waste, we could try to employ what’s called a sharded or Federated architecture. In this approach, there is only one copy of each piece of data, but we’re dividing up that data so that each server is only responsible for a small subset of it. Let’s say we have a database keeping up with our inventory for a big shopping site. It’s really important to have it be consistent so that when people buy, they know the item was in stock. Hey, it’s a contrived example and we know we can cheat on it, but go with it. Let’s further suppose we have 100,000 SKU’s, or different kinds of items in our inventory. We can divide this across 100 servers by letting each server be responsible for 1,000 items. Then we write some code that acts as the go-between with the servers. It simply checks the query to see what you are looking for, and sends your query to the correct sub-server. Voila, you have a sharded architecture that scales very efficiently. Our replicated model would blow out 99 copies from the 1 server, and it could be about 50 times faster (or handle 50x the users as I use a gross 1/2 time estimate for replication time) on reads, but it was no faster at all on writes. That wouldn’t work for our inventory problem because writes are so common during the Christmas shopping season.
Now what are the pitfalls of sharding. First, there is some assembly required. Actually, there is a lot of assembly required. It’s complicated to build such architectures. Second, it may be very hard to load balance the shards. Just dividing up the product inventory across 100 servers is not necessarily helpful. You would want to use a knowledge of access patterns to divide the products so the load on each server is about the same. If all the popular products wound up on one server, you’d have a scaling disaster. These balances can change over time and have to be updated, which brings more complexity. Some say you never stop fiddling with the tuning of a sharded architecture, but at least we don’t have Eventual Consistency. Hmmm, or do we? If you can ever get into a situation where there is more than one copy of the data and the one you are accessing is not up to date, Eventual Consistency could rear up as a design choice made by the DB owners. In that case, they just give you the wrong answer and move on.
How can this happen in the sharded world? It’s all about that load balancing. Suppose our load balancer needs to move some data to a different shard. Suppose the startup just bought 10 more servers and wants to create 10 additional shards. While that data is in motion, there are still users on the site. What do we tell them? Sometimes companies can shut down the service to keep everything consistent while changes are made. Certainly that is one answer, but it may annoy your users greatly. Another answer is to tolerate Eventual Consistency while things are in motion with a promise of a return to full consistency when the shards are done rebalancing. Here is a case where the Eventual Consistency didn’t last all that long, so maybe that’s better than the case where it happens a lot.
Note that consistency is often in the eye of the beholder. If we’re talking Internet users, ask yourself how much harm there would be if a page refresh delivered a different result. In may applications, the user may even expect or welcome a different result. An email program that suddenly shows mail after a refresh is not at all unexpected. That the user didn’t know the mail was already on the server at the time of the first refresh doesn’t really hurt them. There are cases where absolute consistency is very important. Go back to the sharded database example. It is normal to expect every single product in the inventory to have a unique id that lets us find that part. Those ids have to be unique and consistent across all of the shards. It is crucially important that any id changes are up to date before anything else is done or the system can get really corrupted. So, we may create a mechanism to generate consistent ids across shards. This adds still more architectural complexity.
There are nightmare scenarios where it becomes impossible to shard efficiently. I will over simplify to make it easy and not necessarily correct, but I hope you will get the idea. Suppose you’re dealing with operations that affect many different objects. The objects are divided into shards naturally when examined individually, but the operations between the objects span many shards. Perhaps the relationships between shards are incompatible to the extent that there is no way to shard them across machines such that every single operation doesn’t hit many shards instead of a single shard. Hitting many shards will invalidate the sharding approach. In times like this, we will again be tempted to opt for Eventual Consistency. We’ll get to hitting all the shards in our sweet time, and any accesses before that update is finished will just live with inconsistent results. Such scenarios can arise where there is no obvious good sharding algorithm, or where the relationships between the objects (perhaps its some sort of real time collaborative application where people are bouncing around touching objects unpredictably) are changing much too quickly to rebalance the shards. One really common case of an operation hitting many shards is queries. You can’t anticipate all queries such that any of them can be processed within a single shard unless you sharply limit the expressiveness of the query tools and languages.
I hope you come away from this discussion with some new insights:
– Inconsistency derives from having multiple copies of the data that are not all in sync.
– We need multiple copies to scale. This is easiest for reads. Scaling writes is much harder.
– We can keep copies consistent at the expense of slowing everything down to wait for consistency. The savings in relaxing this can be quite large.
– We can somewhat balance that expense with increasingly complex architecture. Sharding is more efficient than replication, but gets very complex and can still break down, for example.
– It’s still cheaper to allow for Eventual Consistency, and in many applications, the user experience is just as good.
Big web sites realized all this long ago. That’s why sites like Amazon have systems like SimpleDB and Dynamo that are built from the ground up with Eventual Consistency in mind. You need to look very carefully at your application to know what’s good or bad, and also understand what the performance envelope is for the Eventual Consistency. Here are some thoughts from the blogosphere:
The documentation for the PutAttributes method has the following note
Because Amazon SimpleDB makes multiple copies of your data and uses an eventual consistency update model, an immediate
Query request (read) immediately after a
PutAttributes request (write) might not return the updated data.
This may or may not be a problem depending on your application. It may be OK for a del.icio.us style application if it took a few minutes before your tag updates were applied to a bookmark but the same can’t be said for an application like Twitter. What would be useful for developers would be if Amazon gave some more information around the delayed propagation such as average latency during peak and off-peak hours.
Here I think Dare’s example of Twitter suffering from Eventual Consistency is interesting. In Twitter, we follow mico-blog postings. What would be the impact of Eventual Consistency? Of course it depends on the exact nature of the consistency, but lets look at our replicated reader approach. Recall that in the Eventual Consistency version, we simply tolerate that we allow reads to come in so fast that some of the replicated read servers are not up to date. However, they are up to date with respect to a certain point in time, just not necessarily the present. In other words, I could read at 10:00 am and get results on one server that are up to date through 10:00 am and on another results only up to date through 9:59 am. For Twitter, depending on which server my session is connected to, my feeds may update a little behind the times. Is that the end of the world? For Twitter users, if they are engaged in a real time conversation, it means the person with the delayed feed may write something that looks out of sequence to the person with the up to date feed whenever the two are in a back and forth chat. OTOH, if Twitter degraded to that mode rather than taking longer and longer to accept input or do updates, wouldn’t that be better?
Onnen wrote a post called “Socializing Eventual Consistency” that has two important points. First, many developers are not used to talking about Eventual Consistency. The knee jerk reaction is that it’s bad, not the right thing, or an unnecessary compromise for anyone but a huge player like Amazon. It’s almost like a macho thing. Onnen lacked the right examples and vocabulary to engage his peers when it was time to decide about it. Hopefully all the chatter about Amazon’s SimpleDB and other massively scalable sites will get more familiarity flowing around these concepts. I hope this article also makes it easier.
His other point is that when push comes to shove, most business users will prefer availability over consistency. I think that is a key point. It’s also a big takeaway from the next blog:
Amazon’s CTO posted to try to make Eventual Consistency and it’s trade offs more clear for all. He lays a lot of good theoretical groundwork that boils down to explaining that there are tradeoffs and you can’t have it all. This is similar to the message I’ve tried to portray above. Eventually, you have to keep multiple copies of the data to scale. Once that happens, it becomes harder and harder to maintain consistency and still scale. Vogels provides a full taxonomy of concepts (i.e. Monotonic Write Consistency et al) with which to think about all this and evaluate the trade offs. He also does a good job pointing out how often even conventional RDMS’s wind up dealing with inconsistency. Some of the best (and least obvious to many) examples include the idea that your mechanism for backups is often not fully consistent. The right answer for many systems is to require that writes always work, but that reads are only eventually consistent.
I’ve covered a lot of consistency related tradeoffs involved in database systems for large web architectures. Rest assured, that unless you are pretty unsuccessful, you will have to deal with this stuff. Get ahead of the curve and understand for your application what the consistency requirements will be. Do not start out being unnecessarily consistent. That’s a premature optimization that can bite you in many ways. Relaxing consistency as much as possible while still delivering a good user experience can lead to radically better scaling as well as making your life simpler. Eventual Consistency is nothing to be afraid of. Rather, it’s a key concept and tactic to be aware of.
Personally, I would seriously look into solutions like Amazon’s Simple DB while I was at it.