SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Archive for the 'amazon' Category


Eventual Consistency Is Not That Scary

Posted by smoothspan on December 22, 2007

Amazon’s new SimpleDB offering, like many other post-modern databases such as CouchDB, offers massive scaling potential if users will accept eventual consistency.  It feels like a weighty decision.  Cast in the worst possible light, eventual consistency means the database will sometimes return the wrong answer in the interests of allowing it to keep scaling.  Gasp!  What good is a database that returns the wrong answer?  Why bother? 

Often waiting for the write answer (sorry, that inadvertant slip makes for a good pun so I’ll leave it in place) returns a different kind of wrong answer.  Specifically, it may not return an answer at all.  The system may simply appear to hang. 

How does all this come about?  Largely, it’s a function of how fast changes in the database can be propogated to the point they’re available to everyone reading from the database.  For small numbers of users (i.e. we’re not scaling at all), this is easy.  There is one copy of the data sitting in a table structure, we lock up the readers so they can’t access it whenever we change that data, and everyone always gets the right answer.  Of course, solving simple problems is always easy.  It’s solving the hard problems that lands us the big bucks.  So how do we scale that out?  When we reach a point where we are delivering that information from that one single place as fast as it can be delivered, we have no choice but to make more places to deliver from.  There are many different mechanisms for replicating the data and making it all look like one big happy (but sometimes inconsistent) database, let’s look at them.

Once again, this problem may be simpler when cast in a certain way.  The most common and easiest approach is to keep one single structure as the source of truth for writing, and then replicate out changes to many other databases for reading.  All the common database software supports this.  If your single database could handle 100 users consistently, you can imagine if those 100 users were each another database you were replication to, suddenly you could handle 100 * 100 users, or 10,000 users.  Now we’re scaling.  There are schemes to replicate the replicated and so on and so forth.  Note that in this scenario, all writing must still be done on the one single database.  This is okay, because for many problems, perhaps even the majority, readers far outnumber writers.  In fact, this works so well, that we may not even use databases for the replication.  Instead, we might consider a vast in-memory cache.  Software such as memcached does this for us quite nicely, with another order of magnitude performance boost since reading things in memory is dramatically faster than trying to read from disk.

Okay, that’s pretty cool, but is it consistent?  This will depend on how fast you can replicate the data.  If you can get every database and cache in the system up to date between consecutive read requests, you are sure to be consistent.  In fact, it just has to get done between read requests for any piece of data that changed, which is a much lower bar to hurdle.  If consistency is critical, the system may be designed to inhibit reading until changes have propogated.  It take some very clever algorithms to do this well without throwing a spanner into the works and bringing the system to its knees performance-wise. 

Still, we can get pretty far.  Suppose your database can service 100 users with reads and writes and keep it all consistent with appropriate performance.  Let’s say we replace those 100 users with 100 copies of your database to get up to 10,000 users.  It’s now going to take twice as long.  During the first half, we’re copying changes from the Mother Server to all of the children.  The second half we’re serving the answers to the readers requesting them.  Let’s say we can keep the overall time the same just by halving how many are served.  So the Mother Server talks to 50 children.  Now we can scale to 50 * 50 = 2500 users.  Not nearly as good, but still much better than not scaling at all.  We can go 3 layers deep and have Mother serve 33 children each serve 33 grand children to get to 33 * 33 * 33 = 35,937 users.  Not bad, but Google’s founders can still sleep soundly at night.  The reality is we probably can handle a lot more than 100 on our Mother Server.  Perhaps she’s good for 1000.  Now the 3-layered scheme will get us all the way to 333*333*333 = 36 million.  That starts to wake up the sound sleepers, or perhaps makes them restless.  Yet, that also means we’re using over 100,000 servers too: 1 Mothers talks to 333 children who each have 333 grandchildren.  It’s a pretty wasteful scheme.

Well, let’s bring in Eventual Consistency to reduce the waste.  Assume you are a startup CEO.  You are having a great day, because you are reading the wonderful review of your service in Techcrunch.  It seems like the IPO will be just around the corner after all that gushing does it’s inevitable work and millions suddenly find their way to your site.  Just at the peak of your bliss, the CTO walks in and says she has good news and bad news.  The bad news is the site is crashing and angry emails are pouring in.  The other bad news is that to fix it “right”, so that the data stays consistent, she needs your immediate approval to purchase 999 servers so she can set up a replicated scheme that runs 1 Mother Server (which you already own) and 999 children.  No way, you say.  What’s the good news?  With a sly smile, she tells you that if you’re willing to tolerate a little eventual consistency, your site could get by on a lot fewer servers than 999.

Suppose you are willing to have it take twice as long as normal for data to be up to date.  The readers will read just as fast, it’s just that if they’re reading something that changed, it won’t be correct until the second consecutive read or page refresh.  So, our old model that had the system able to handle 1,000 users, and replicated to 999 servers to handle 1 million users used to have to go to 3 tiers (333 * 333 * 333) to get to the next level at 36 million and still serve everything consistently and just as fast.  If we relax the “just as fast”, we can let our Mother Server handle 2,000 at half the speed to get to 2000 * 1000 = 2 million users on 3 tiers with 2000 servers instead of 100,000 servers to get to 36 million. If we run 4x slower on writes, we can get 4000*1000 = 4 million users with 4000 servers.  Eventually things will bog down and thrash, but you can see how tolerating Eventual Consistency can radically reduce your machine requirements in this simple architecture.  BTW, we all run into Eventual Consistency all the time on the web, whether or not we know it.  I use Google Reader to read blogs and WordPress to write this blog.  Any time a page refresh shows you a different result when you didn’t change anything, you may be looking at Eventual Consistency.  Even if you suspect others changed something, Google Reader still comes along frequently and says an error occured and asks me to refresh.  It’s telling me they relied on Eventual Consistency and I have an inconsistent result.

As I mention, these approaches can still be wasteful of servers because of all the data copies that are flowing around.  This leads us to wonder, “What’s the next alternative?”  Instead of just using servers to copy data to other servers, which is a prime source of the waste, we could try to employ what’s called a sharded or Federated architecture.  In this approach, there is only one copy of each piece of data, but we’re dividing up that data so that each server is only responsible for a small subset of it.  Let’s say we have a database keeping up with our inventory for a big shopping site.  It’s really important to have it be consistent so that when people buy, they know the item was in stock.  Hey, it’s a contrived example and we know we can cheat on it, but go with it.  Let’s further suppose we have 100,000 SKU’s, or different kinds of items in our inventory.  We can divide this across 100 servers by letting each server be responsible for 1,000 items.  Then we write some code that acts as the go-between with the servers.  It simply checks the query to see what you are looking for, and sends your query to the correct sub-server.  Voila, you have a sharded architecture that scales very efficiently.  Our replicated model would blow out 99 copies from the 1 server, and it could be about 50 times faster (or handle 50x the users as I use a gross 1/2 time estimate for replication time) on reads, but it was no faster at all on writes.  That wouldn’t work for our inventory problem because writes are so common during the Christmas shopping season. 

Now what are the pitfalls of sharding.  First, there is some assembly required.  Actually, there is a lot of assembly required.  It’s complicated to build such architectures.  Second, it may be very hard to load balance the shards.  Just dividing up the product inventory across 100 servers is not necessarily helpful.  You would want to use a knowledge of access patterns to divide the products so the load on each server is about the same.  If all the popular products wound up on one server, you’d have a scaling disaster.  These balances can change over time and have to be updated, which brings more complexity.  Some say you never stop fiddling with the tuning of a sharded architecture, but at least we don’t have Eventual Consistency.  Hmmm, or do we?  If you can ever get into a situation where there is more than one copy of the data and the one you are accessing is not up to date, Eventual Consistency could rear up as a design choice made by the DB owners.  In that case, they just give you the wrong answer and move on. 

How can this happen in the sharded world?  It’s all about that load balancing.  Suppose our load balancer needs to move some data to a different shard.  Suppose the startup just bought 10 more servers and wants to create 10 additional shards.  While that data is in motion, there are still users on the site.  What do we tell them?  Sometimes companies can shut down the service to keep everything consistent while changes are made.  Certainly that is  one answer, but it may annoy your users greatly.  Another answer is to tolerate Eventual Consistency while things are in motion with a promise of a return to full consistency when the shards are done rebalancing.  Here is a case where the Eventual Consistency didn’t last all that long, so maybe that’s better than the case where it happens a lot. 

Note that consistency is often in the eye of the beholder.  If we’re talking Internet users, ask yourself how much harm there would be if a page refresh delivered a different result.  In may applications, the user may even expect or welcome a different result.  An email program that suddenly shows mail after a refresh is not at all unexpected.  That the user didn’t know the mail was already on the server at the time of the first refresh doesn’t really hurt them.  There are cases where absolute consistency is very important.  Go back to the sharded database example.  It is normal to expect every single product in the inventory to have a unique id that lets us find that part.  Those ids have to be unique and consistent across all of the shards.  It is crucially important that any id changes are up to date before anything else is done or the system can get really corrupted.  So, we may create a mechanism to generate consistent ids across shards.  This adds still more architectural complexity.

There are nightmare scenarios where it becomes impossible to shard efficiently.  I will over simplify to make it easy and not necessarily correct, but I hope you will get the idea.  Suppose you’re dealing with operations that affect many different objects.  The objects are divided into shards naturally when examined individually, but the operations between the objects span many shards.  Perhaps the relationships between shards are incompatible to the extent that there is no way to shard them across machines such that every single operation doesn’t hit many shards instead of a single shard.  Hitting many shards will invalidate the sharding approach.  In times like this, we will again be tempted to opt for Eventual Consistency.  We’ll get to hitting all the shards in our sweet time, and any accesses before that update is finished will just live with inconsistent results.  Such scenarios can arise where there is no obvious good sharding algorithm, or where the relationships between the objects (perhaps its some sort of real time collaborative application where people are bouncing around touching objects unpredictably) are changing much too quickly to rebalance the shards.  One really common case of an operation hitting many shards is queries.  You can’t anticipate all queries such that any of them can be processed within a single shard unless you sharply limit the expressiveness of the query tools and languages.

I hope you come away from this discussion with some new insights:

-  Inconsistency derives from having multiple copies of the data that are not all in sync.

-  We need multiple copies to scale.  This is easiest for reads.  Scaling writes is much harder.

-  We can keep copies consistent at the expense of slowing everything down to wait for consistency.  The savings in relaxing this can be quite large.

-  We can somewhat balance that expense with increasingly complex architecture.  Sharding is more efficient than replication, but gets very complex and can still break down, for example. 

-  It’s still cheaper to allow for Eventual Consistency, and in many applications, the user experience is just as good.

Big web sites realized all this long ago.  That’s why sites like Amazon have systems like SimpleDB and Dynamo that are built from the ground up with Eventual Consistency in mind.  You need to look very carefully at your application to know what’s good or bad, and also understand what the performance envelope is for the Eventual Consistency.  Here are some thoughts from the blogosphere:

Dare Obasanjo

The documentation for the PutAttributes method has the following note

Because Amazon SimpleDB makes multiple copies of your data and uses an eventual consistency update model, an immediate GetAttributes or Query request (read) immediately after a DeleteAttributes or PutAttributes request (write) might not return the updated data.

This may or may not be a problem depending on your application. It may be OK for a del.icio.us style application if it took a few minutes before your tag updates were applied to a bookmark but the same can’t be said for an application like Twitter. What would be useful for developers would be if Amazon gave some more information around the delayed propagation such as average latency during peak and off-peak hours.

Here I think Dare’s example of Twitter suffering from Eventual Consistency is interesting.  In Twitter, we follow mico-blog postings.  What would be the impact of Eventual Consistency?  Of course it depends on the exact nature of the consistency, but lets look at our replicated reader approach.  Recall that in the Eventual Consistency version, we simply tolerate that we allow reads to come in so fast that some of the replicated read servers are not up to date.  However, they are up to date with respect to a certain point in time, just not necessarily the present.  In other words, I could read at 10:00 am and get results on one server that are up to date through 10:00 am and on another results only up to date through 9:59 am.  For Twitter, depending on which server my session is connected to, my feeds may update a little behind the times.  Is that the end of the world?  For Twitter users, if they are engaged in a real time conversation, it means the person with the delayed feed may write something that looks out of sequence to the person with the up to date feed whenever the two are in a back and forth chat.  OTOH, if Twitter degraded to that mode rather than taking longer and longer to accept input or do updates, wouldn’t that be better? 

Erik Onnen

Onnen wrote a post called “Socializing Eventual Consistency” that has two important points.  First, many developers are not used to talking about Eventual Consistency.  The knee jerk reaction is that it’s bad, not the right thing, or an unnecessary compromise for anyone but a huge player like Amazon.  It’s almost like a macho thing.  Onnen lacked the right examples and vocabulary to engage his peers when it was time to decide about it.  Hopefully all the chatter about Amazon’s SimpleDB and other massively scalable sites will get more familiarity flowing around these concepts.  I hope this article also makes it easier.

His other point is that when push comes to shove, most business users will prefer availability over consistency.  I think that is a key point.  It’s also a big takeaway from the next blog:

Werner Vogels

Amazon’s CTO posted to try to make Eventual Consistency and it’s trade offs more clear for all.  He lays a lot of good theoretical groundwork that boils down to explaining that there are tradeoffs and you can’t have it all.  This is similar to the message I’ve tried to portray above.  Eventually, you have to keep multiple copies of the data to scale.  Once that happens, it becomes harder and harder to maintain consistency and still scale.  Vogels provides a full taxonomy of concepts (i.e. Monotonic Write Consistency et al) with which to think about all this and evaluate the trade offs.  He also does a good job pointing out how often even conventional RDMS’s wind up dealing with inconsistency.  Some of the best (and least obvious to many) examples include the idea that your mechanism for backups is often not fully consistent.  The right answer for many systems is to require that writes always work, but that reads are only eventually consistent.

Conclusion

I’ve covered a lot of consistency related tradeoffs involved in database systems for large web architectures.  Rest assured, that unless you are pretty unsuccessful, you will have to deal with this stuff.  Get ahead of the curve and understand for your application what the consistency requirements will be.  Do not start out being unnecessarily consistent.  That’s a premature optimization that can bite you in many ways.  Relaxing consistency as much as possible while still delivering a good user experience can lead to radically better scaling as well as making your life simpler.  Eventual Consistency is nothing to be afraid of.  Rather, it’s a key concept and tactic to be aware of.

Personally, I would seriously look into solutions like Amazon’s Simple DB while I was at it. 

Posted in amazon, data center, enterprise software, grid, platforms, soa, software development | 4 Comments »

To Rule the Clouds Takes Software: Why Amazon SimpleDB is a Huge Next Step

Posted by smoothspan on December 15, 2007

One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them…

J. R. R. Tolkien

There is much interesting cloud-related news in the blogosphere.  Various pundits are sharing a back and forth on the potential for cloud centralization to result in just a very few datacenters and what that might mean.  The really big news is Amazon’s fascinating new addition to their cloud platform of SimpleDB.  Let’s talk about what it all means.

Sun’s CTO, Greg Papadopoulos, has been predicting that the earth’s compute resources will resolve into about “five hyperscale, pan-global broadband computing services giants” — with Sun, in its version of this future scenario, the primary supplier of hardware and operations software to those giants. The last was channeled via Phil Wainewright, who goes on to ask, “What is it about a computing grid that’s inherently “more centralized” in nature?”  He feels that Nick Carr has missed the mark and swallowed Sun’s line hook, line, and sinker.  For his part, Carr’s only crime was to seize on a good story, because at the same dinner, another Sun executive, Subodh Bapat, was telling Carr that sometime soon a major datacenter failure would have “major national effects.”  The irony is positively juicy with Sun talking out both sides of their proverbial mouths.

The tradeoff that Carr and Wainewright are worried about is one of economies of scale that favor centralization versus flexibility and resiliency that favors decentralization.  Where they differ is that Carr sees economies of scale winning in a world where IT matters less and less and Wainewright favors the superior architectural possibilities of decentralization.  Is datacenter centralization inexorable?  In a word, yes, but it may not boil down to just 5 data center owners, and it may take quite a while for the forces at work to finish this evolution.  The factors that determine who the eventual winners will be are also quite interesting, and have the potential to change a lot of landscapes that today are relatively isolated.  Let’s consider what the forces of centralization are.

First, there is a huge migration of software underway to the cloud.  In other words, software that is never installed on your machine or in your company’s datacenter.  It resides in the cloud and comes to you via the browser.  Examples include SaaS on the business side and the vast armada of consumer Web 2.0 products such as Facebook.  No category is safe from this trend, not even traditional bastions as should be clear from the growing crop of Microsoft Office competitors that reside in the cloud.

Second, this migration leads to centralization.  The mere act of building around a cloud architecture, even if it is a private cloud in your own company’s datacenter, leads to centralization.  After all, software is moving off your desktop and into that datacenter.  When many companies are aggregated into a single datacenter, into a SaaS multi-tenant architecture, for example, further centralization occurs.  When you offer a ubiquitous service to the masses, as is the case with something like Google, the requirements to deliver that can lead to some of the largest datacenter operations in the land. 

Third, there are the afore-mentioned economies of scale.  Google has grown so large that it now builds its own special-purpose switches and servers to enable it to grow more cheaply.  The big web empires are all built on the notion of scaling out rather than scaling up, and they run on commodity hardware.  Because they have so many servers, automating their care and feeding has been baked into their DNA.  Not so with most corporate datacenters that are just beginning to see the fruits of crude generic technologies like virtualization that seek to be all things to all people.  Virtualization is a great next step for them, but there are bigger steps ahead yet that will further reduce costs.

Fourth, the ultimate irony is that centralization begats centralization through network effects.  This is the story of the big consumer web properties.  Every person that joins a social network adds more value to the network than the prior person did.  The value of the network grows exponentially.  This connectedness is facilitated most easily in today’s world by centralization.  Vendors that start to get traction increase their network effects in various ways:  Amazon charges to bring data in and out of their cloud, but not to transfer between services within the cloud.

Lastly, there are green considerations at work.  The biggest costs associated with datacenters these days are around electricity and cooling.  Microsoft is building a data center in Siberia, which is both cold and pretty central to Asia.  Consider this:  given the speed of light over a fiber connection, what is the cost of latency in having a data center somewhere far north (and cold) in Canada like Winnipeg versus far south (and hot) like Austin, Texas?  It’s 1349 miles, which, as the photon travels (186,000 miles per second) is about 7.2 milliseconds.  The world’s fastest hard drive, the nifty Mtron solid state disks I’m now coveting thanks to Engadget and Kevin Burton, can only write a paltry 80K or so bytes in that time:  not even enough for one photo at decent resolution.  So consider a ring of datacenter clusters built in colder regions.  Centralized computing is up north where the cold that computers like is nearly free for the asking: just open a window many days.  Or come closer.  Put it up on a mountain peak.  Immerse it near a hydro dam and get the juice cheaper too.  It doesn’t matter.  Laying fiber is pretty cheap compared to paying the energy bills.

The next question is trickier: how do these clouds compete?  Eventually, they will become commoditized, and they will compete on price, but we are a long ways from that point.  At least 10 years or more.  Before that can happen, customers have to agree on what the essential feature sets are for this “product”.  I believe this is where software comes into play, and that should be a matter of great concern for the hosting providers of today whose expertise largely does not revolve around software as a way to add value.   As Eric Schmidt said (via Nick Carr) when he started saying Google would enter this market:

For clouds to reach their potential, they should be nearly as easy to program and navigate as the Web. This, say analysts, should open up growing markets for cloud search and software tools—a natural business for Google and its competitors.

Some will immediately react with, “Hold it a minute, what about the hardware?  What about the network?”  The best of the cloud architectures will commoditize those considerations away.  In fact, commoditization will start down at the bottom of the technology stack and work its way up.  The first stage of that, BTW, is already almost over.  That was the choice of CPU.  MIPS?  PowerPC?  SPARC?  No, Intel/AMD are the winners.  The others still exist (not all of them!), but they’ve peaked and are on their way down at various terminal velocities.  Their owners need to milk them for profit, but it would be a losing battle to invest there.  Even Macs now carry Intel inside, and Sun now carries the ticker symbol “JAVA”, a not-so-subtle hat tip to the importance of software.

Hardware boxes are largely a dead issue too.  There is too little opportunity to differentiate for very long and the cpu’s dictate an awful lot of what must be done.  Dell is an assembler and marketer of the lowest cost components delivered just in time lest they devalue in inventory.  Sun still pushes package design, and it may have some relevance to centralization, but this will be commoditized because of centralization.

Next up will be the operating system.  Again, we’re pretty far down the path of Linux.  Corporations still carry a lot of other things inside their firewalls, but the clouds will be populated almost exclusively with Linux, and we could already see that has happened if we could get reliable statistics on it.  Linux defines the base minimum of what a cloud offering has to provide:  utility computing instances running Linux.  This is exactly what Amazon’s EC2 offers.

What else does the cloud need?  Reliable archival storage.  Again, Amazon offers this with S3.  Cloud consumers are adopting it in droves because it makes sense.  It’s a better deal than a raw disk array because it adds value versus that disk array for archival storage.  The value is in the form of resiliency and backup.  Put the data on S3 and forget about those problems.  This begins the commoditization of storage.  Is it any wonder that EMC bought VMWare and that a software offering is now most of their market cap?  Hardware guys, put on your thinking caps, this will get much worse.  What software assets do you bring to the table.

3Tera is a service I’ve talked about before that has a very similar offering available from multiple hosting partners of theirs.  They create a virtual SAN that you can backup and mirror at the click of a mouse.  They let you configure Linux instances to your heart’s content.  Others will follow.  IBM’s Blue Cloud offers much the same.  This collection is today’s blueprint for what the Cloud offers in terms of a platform.

But, this platform is a moving target, and it will keep moving up the stack.  Amazon just announced another rung up with SimpleDB.  For most software that goes into the Cloud, once you have an OS and a file system, the next thing you want to see is a database.  Certainly when I attended Amazon Startup Project, the availability of a robust database solution was the number one thing folks wanted to see Amazon bring out.  The GM of EC2 promised me that this was on the way and that there would be several announcements before the end of the year.  First we saw the availability of EC2 instances that had more memory, disk, and cpu, so that they’d make better database hosts.  SimpleDB is much more ambitious.  It’s a replacement for the conventional database as embodied in products like mySQL and Oracle that was designed from the ground up to live in a cloud computing world.  At one stroke it solves a lot of very interesting problems that used to challenge would-be EC2 users around the database.

Along the lines of my list of factors that drive data center centralization, Phil Windley says the economics are impossible to stop.  Scoble asks whether MySQL, Oracle, and SQL Server are dead:

Since Amazon brought out its S3 storage service, I’ve seen many many startups give up data centers altogether.

Tell me why the same thing won’t happen here.

There is no doubt in my mind that all startups will give up having datacenters altogether before this ends.  However, before we get too head up in assuming that SimpleDB gives us that opportunity, let’s drop back and consider what it’s limitations are:

- It is similar to a relational database, but there are significant differences.  Code will have to be reworked to run there, even if it doesn’t run afoul of the other issues.

- Latency is a problem when your database is in another datacenter from the rest of your code.  Don MacAskill brings this one up, and all I can say is that this is another network effect that leads to more centralization.  If you like Simple DB, it’s another reason to bring all of your code inside Amazon’s cloud.

- All fields are strings, and they are limited to 1024 characters.  Savvy developers can use the 1024 characters to find unlimited size files on S3, as well as other methods like combining fields to get around this limit.  Mind you, a lot can be done with that, but it is again a difference from traditional RDMS systems and it means more work for developers that must overcome the limitation.

- There are no joins, if you want them (and many proponents of hugely scalable sites view joins as evil), you have to roll your own. 

- Transactions and consistency are also absent.  Reads are not guaranteed to be fully up to date with writes.

- There is no indexing and a whole host of other trappings that database afficionados have gotten comfortable with.

Mind you, serious web software is created within these limitations including some at Amazon itself.  In exchange for living with them, you get massively scalable database access at good performance and very cheaply.  And, as Techcrunch says, you may be able to get rid of one of the highest cost IT operations jobs around, database administration, and your costs are even lower.  Remember my analysis that shows SaaS vendors need to achieve 16:1 operations cost advantages over conventional software and you can see this is a big step in that direction already.

There is no doubt that cloud computing will be massively disruptive, and that Amazon are well on their way in the race to plant their flag at the top of the mountain.  The pace of progress for Amazon Web Services has been blistering this year, and much more hype free than what we’ve gotten from the likes of Google and Facebook when it comes to platform speak.  It’s almost odd that we haven’t heard more from these other players, and especially from the likes of Google.  GigaOm says that Simple DB completes the Amazon Web Services Trifecta.  They go on to say that Amazon’s announcements have the feel of a well thought out long term strategy, while Google’s make it sound like the ad hoc grab bag of tools.  I think that’s true, and perhaps reflective of Google’s culture, which is hugely decentralized to the point of giving developers 20% free time to work on projects of their choosing.  The problem is that such a culture can more easily give us a grab bag of applications, as Google has, than it can provide a well-designed platform, as Amazon has.  Or, as Mathew Ingram puts it, while everyone else was talking about it, Amazon went ahead and did it.

I’ve talked to a dozen or so startups that are eagerly working with the Amazon Web Services and having great success, as well as some frustrations.  They require rethinking the old ways.  Integrity issues are particularly different in this brave new world, as are issues of latency.  That matters to how a lot of folks think about their applications.  Because of the learning curve, I don’t plan to go out and short Oracle immediately, but the sand has started running in the hourglass.  There will be more layers added to the cloud, and over time it will become harder and harder to ignore.  There will be economic advantage to those who embrace the new ways, and penalties for those who don’t.  This is a bet-your-business drama that’s unfolding, make no mistake.  At the very least, you need to get yourself educated about what these kinds of services offer and what they mean for application architecture.

Business located low in the stack I’ve mentioned will be hit hard if they don’t have a strategy to embrace and win a piece of the cloud computing New Deal.  We’re talking hardware manufacturers like Sun, Dell, IBM, and HP.  Software infrastructure comes next.  Applications that depend on low cost delivery, aka SaaS, are also very much in the crosshairs, although probably at a slightly later date.

Welcome to the brave new world of utility cloud computing.  Long live the server, the server is dead!

Related Articles

Amazon Raises the Cloud Platform Bar Again With DevPay

Coté’s Excellent Description of the Microsoft Web Rift :  Nice post on cloud computing at Microsoft

Posted in Web 2.0, amazon, data center, ec2, grid, platforms, saas | 10 Comments »

A Kindle User After My Own Heart

Posted by smoothspan on November 29, 2007

Go read Josh Taylor’s post on how he took a Kindle to the Carribean and why he has fallen into “deep like” for the device after that.  Being able to travel without a suitcase full of books was the first lightbulb that lit for me when I heard about Kindle.  The truth is, I’d seen an eBook a long long time before Kindle.  I can’t even remember whose it was, but we’re talking before Blackberry even existed.  It was a lame device back then, but I would still have bought one but for lack of decent book selection.  Despite O’Reilly not being there yet, I think Amazon has the means to fix the selection problem, and the device is certainly light years ahead of most of what we’ve seen even if many are still unconvinced.

Tidbits from Taylor’s post:

  • Taylor loved the Kindle’s screen for reading text, but says graphics, even black and white pictures, are almost hopeless.  I still haven’t personally seen a Kindle, but my friend Song Huang was recently telling me how impressive the eInk display is.  He saw one at a conference somewhere and was convinced they had just stuck a piece of paper behind glass as a mockup.  When the thing updated and showed it wasn’t paper, he was blown away.  I’d love to hear whether line art looks good on a Kindle.  That’s the sort of thing I’d want if reading a technical book, although it’s a shame actual pictures are so poor–it’ll make it hard to see screen shots.
  • As to the UI, Taylor loves the navigation but laments you can’t put Kindle away without accidentally flipping a page.  Has no one ever been reading their paperback, nodding off, dropped the book and lost their place entirely?  Must be my age if I’m the only one.  He also had an incident where his wife went to the beach without a proper charge and the Kindle died.  Doh!  Hate when that happens!

I also liked learning that Amazon will let you grab the first chapter of any book free to see if you like it before purchasing.  As I wrote in my original Kindle post, there are lots of ways the buying experience can be enhanced by Kindle.  One of my minor book purchasing peccadilos is an inability to keep track of all the authors and which of their novels I already have.  Every now and then I wind up with two copies of something.  Nothing worse than diving into what you think is a new offering from a favorite author only to discover you’ve already read the book!  I want to be able to get into a book club for my faves whereby I get notified as soon as something new is available and I can get the book with one click.  BTW, Amazon is famous for patenting the one click (I believe the recently lost that patent too).  I would expect them to try to patent a lot of the new stuff behind Kindle.  Patents are not my favorite thing, but they are a fact of life.

Scoble ran an interview on the street with a woman who wanted to see his Kindle while he was giving a talk at Stanford.  I came away from the interview with a slightly different reaction than I think Scoble and others may have.  There is a view that Kindle’s foibles are disasterous, but I’m not at all convinced.  Scoble points out that this woman hit many of his complaints almost immediately:

Notice that she accidentally hits the “next” button. That she tries to use it as a touch screen. That she is bugged by the refresh rate. But, she, like me, is interested enough to want to buy one (she’s the first that I’ve shown it to that has that reaction). Imagine if Amazon had designed it better? Imagine how many more people would want it.

The thing is, if you watch the video, none of that bothered her.  She made an assumption that is common outside Silicon Valley: if the thing didn’t work as she expected it to, it was not a problem, it just meant she needed to learn.  Sometimes I think we get too focused on a particular view of how things have to work in the Valley, and we’re way over the top critical when they don’t.  Many successful products are riddled with inconsistencies, but work so well compared to the alternatives that we ignore them.  I’m typing this in WordPress and let me tell you, it has at least as many UI foibles as Kindle, but it doesn’t matter, and it’s wildly successful.

I do agree with Scoble that if Kindle had been as perfect as iPhone or iPod from the get go, if it had been just as sexy, and just as “right”, Kindle would be a much bigger success.  However, let’s reflect on two thoughts.  First, Josh Taylor remarks that the Kindle must be popular because you can’t get one.  Note that this may not be the whole story.  Amazon may be limiting supply for a variety of reasons.  They want to understand usage patterns better to see if they can make money, or they want to respond to user criticisms without having a ton of inventory, or even they want to make sure it doesn’t damage their lucrative Christmas season.  Second, iPhone and iPod were not first generation devices in their categories.  I suppose we can argue that Kindle isn’t either, but it seems to me the precursors of the Apple products were much closer to success than Kindle’s precursors are.

All this has, um, kindled my desire to have a Kindle.  Still not sure I’ll put it on the Christmas list (you can’t seem to get one anyway), but my birthday is early in the year.  I just hope to see the rumored Apple Tablet device before I have to pull the trigger.  Wouldn’t it be awesome if Amazon takes the Open Road and has an OEM offering for other eBook builders?  Wouldn’t it be even more awesome if the Apple Tablet picked up the backend of the Kindle service and accessed it from their own UI?  Whoa!  Stranger things have happened, but not often…

Posted in Web 2.0, amazon, platforms, user interface, wireless | 3 Comments »

One Cloud, Two Clouds, Four Clouds, More?

Posted by smoothspan on November 26, 2007

GigaOm writes recently that the world may only need 5 clouds, echoing a misquote attributed to Thomas J. Watson at IBM.  Nick Carr is much closer to the likely outcome when he writes about Vertical Clouds, an interesting article well worth a read.

The reality is that we have not yet settled on exactly what product the clouds are offering.  Today, we’re at the lowest common denominator of Linux dial tone along with bulk storage.  That’s Amazon EC2 and S3 in a nutshell.  It may be that after a suitable period of consolidatin the world only needs 5 or so Linux dial tone offerings.  Given the number of nearly identical services offering web and email servers, we seem to be a long long ways from that consolidation.

I like Carr’s idea better.  Linux dial tone is useful, but doesn’t ultimately doesn’t take utility computing very far along the path to its true potential.  That potential extends well beyond broadly generic services and into vertically oriented spaces just as non-cloud computing products do.  Just take a look at the plethora of database offerings alone.  There are Open Source databases like MySQL, column store and other specialized DB’s for Business Intelligence, Enterprise mainstays like Oracle and DB2, database hardware like Teradata, and so on.  Each of these is filling a particular niche and ecosystem.  Each of those niches can spawn at least one specialized cloud to service the interests of the niche.  The clouds are a long ways from being mature enough to start taking on multiple niches at once.

The linkages between clouds will also be interesting.  Someone remarked to me that the problem with the cloud is there isn’t just one, there are many, and there are walls between them.  We’re starting to see those walls break down in some cases.  Look at the offer by Joyent to do hosting of Facebook apps.  The offer is free to the first 3500 developers to sign up.  How do they do it?  By means of a special cloud-to-cloud link:

There is also no latency. We have set up a direct physical fiber optic line between the Joyent data center and Facebook’s data center. Somewhere under San Francisco bay, there is a multiple-gigabit-per-second fiber line capable of pumping massive traffic.

Fiber is remarkably cheap.  I would expect to see more partnerships between non-competing cloud vendors who provide connections between their clouds that offer advantages in terms of bandwidth, cost for bandwidth, and latency just like this example.  Imagine you’re creating an enterprise application of some kind.  Perhaps it’s even CRM and you want to host a component on Force.  But, Force is expensive, and you don’t want everything there.  Perhaps you’ll also need a connection to WebEx so you can do teledemos with customers.  Lastly, you want some kind of Business Intelligence capability that goes well beyond what Force offers.  Now let’s suppose you discover a utility computing vendor that has special cloud connections out to Force and WebEx, and offers BI capabilities as part of their service.  That would be exciting to you, and probably to a lot of other vendors.

Here’s another one that matters: geography.  Which geographies does your cloud vendor cover and how does that map back to your business.  Amazon recently announced the ability to target S3 data to their European datacenter.  Much more will follow.  Connections to the CDN’s will also factor in here.  There are a lot of other scenarios that could develop.  Suppose Facebook decides hosting is a way to monetize.  If you want to tie into their Social Graph database, you have to build your application in their cloud.  Hmmm.  That’s a head scratcher.  What could companies like Google do by persuading you to move into their cloud?  What if there was an economic rational to save both parties money by colocating?  Perhaps it becomes cheaper for Google to search your content and part of that savings is passed on to make it cheaper for you to host your content inside Google’s cloud. 

Get ready for a lot of complexity and choice to be injected to the cloud computing picture.  The connections that take place between different categories of Enterprise software are pretty well understood.  What’s less well understood is how they will manifest themselves as connections between clouds.  It seems clear that there are many opportunities for success out there.  Many more than just five clouds will be needed.  In fact, Business Development people should take note: before too long, a piece of the puzzle will involve asking which clouds are directly connected to which other clouds? 

Cloud computing is about to get much more interesting!

Posted in Web 2.0, amazon, data center, saas, strategy | No Comments »

Will MS Office Or Oracle Be Slaughtered First By The Cloud?

Posted by smoothspan on November 26, 2007

There’s a spate of articles about the new Live Documents service, a cloud-based SaaS challenger to Microsoft Office.  Like any good entrepreneur, CEO and founder Sabia Bhatia claims his new baby and it’s relatives will displace MS Office by 2010.  That’s right around the corner, but wildy optimistic.  ZDNet’s Dan Farber calls these cloud Office wannabes “ankle biters”, and with good reason.  They don’t come close to posing a threat to Microsoft yet.  It’s not for lack of trying.  There’s a pretty good crowd of them out there now.  Live Documents has joined a cast that includes Google, Zimbra (now owned by Yahoo), Zoho, ThinkFree, Adobe, and others.

Why can’t these guys take over by 2010?  Because every story you read is largely a man bites dog story.  There’s little to no discussion of amazing new features these products offer that would give a reason to switch.  The mere fact that products calling themselves Office equivalents can exist in the cloud without needing to be installed seems to be newsworthy enough.  There are no great roundup reviews that are getting big attention in the blogosphere that play them off against each other.  What one does find are articles telling us about the introduction of features that seem painfully elementary.  It wasn’t so long ago that the Google Spreadsheet learned how to hide columns, for example.  Even as TechCrunch writes “While Live Documents Yaps, Zoho Delivers,” Stowe Boyd writes that he can’t get Zoho to play nicely with Google Gears, even though the ability to work disconnected is the big newly announced feature for Zoho.  Apparently the Live Docs messaging annoys Michael Arrington, who writes:

New product press releases unencumbered by the complexities of releasing actual software set off alarm bells. And when those press releases are so boastful as to suggest that the (unlaunched) product can hurt a competitor’s $20 billion revenue stream, the alarm bells get much louder…

So far Live Documents is nothing more than bullshit and smokescreens. That may have been the way to do business when Bhatia co-founded Hotmail in 1996, but his software is going to have to survive on its own in a hyper competitive marketplace when it actually launches. Hubris alone won’t do it.

From my perspective, this matter-of-fact let’s paste together a bunch of things out on the web and not worry too much if they work well is a problem for an Office Killer.  The Microsoft Office represents basic literacy in the business world.  Give it up and you may find you’re unable to speak the lingua franca of others you must communicate with day to day.  Real challengers have to keep this in mind.  I was a General in the last Office Suite Wars, having fathered Quattro Pro at Borland.  We made considerable inroads (Quattro Pro sold on the order of $100 million its first year) but ultimately fell by the wayside because we lacked a word processor to go with our suite.  The one thing I can tell you is that absolute 100% compatibility with the market leader at the time together with significant innovation over that market leader and significant economic advantage (we were much cheaper at the outset) were key to the success we did achieve.  I don’t have a sense these upstarts have achieved any of these ideals.

There is a market that I think is more interesting when we look at who might become a Cloud Casualty sooner rather than later.  I’m speaking of Oracle, of course, and specifically of the database server business.  MySQL appears poised for an IPO, but beyond that there is a raft of contenders out there who have achieved a lot relative to Oracle.  There are fairly Oracle compatible products like Enterprise DB.  There are products that have serious innovations such as the column-based DB’s.  And the economic advantages are undeniable.  Unlike the Office Suite arena, it gets harder and harder to find significant killer features that Oracle offers that don’t exist in the Open Source DB world.  We’ve seen extremely large web sites powered by MySQL and some of these others, for example.  It would be hard to claim Oracle is dramatically more scalable in the face of the evidence, although one can likely conclude it remains easier to scale and more performant. 

The problem is that the costs associated with Oracle licenses are positively usary.  A friend who runs a SaaS company says his Oracle costs are bigger than his hardware costs.  I asked him why he continues to use it and he indicated his CTO was convinced he had to for scalability.  At my last company, Callidus, we ran some tests and were surprised at how performant these solutions were compared to Oracle.  I believe that currently, it’s an inertia thing.  People are sure they can get Oracle to scale, they have people on board who know how, and unless cost is a serious issue from the get go (as it can be with startups focused on ad revenues), the tried and true is chosen.  When enough people have experience scaling MySQL and its relatives, that inertia will have gone away.  Venture Capitalists tell me most of their portfolio companies are there.  As companies built on technologies like MySQL mature and people move on to other jobs, the word spreads.  Businesses running old-school on-premises software will be the last bastions for that inertia, but realistically, I’ve still talked to members from that elite group who are keeping COBOL CICS and IBM AS400 software alive. 

It won’t take an awful lot of shift before Oracle feels the pain.  The problem is that Oracle depends on this business as a cash cow to finance it’s other expansion.  Knock 20-25% off the top through erosion to these upstarts and it will dramatically change the Oracle profit picture for the worse.  Here’s another interesting strategic point to consider: utility computing ties together larger tectonic plates that can result in greater and more sudden market shifts.  Imagine companies like Amazon deciding to offer database dial-tone in their rent-a-clouds.  The database is the most labor intensive and problematic piece of software in the whole suite.  If someone automates those problems away, promises scalability on a utility computing grid, and handles normal SQL, many will rush at the opportunity.  Such aggregation of users can drive a lot of license fees away in a hurry.  They also take away a lot of the intertia issue in a couple of ways.  First, the cloud service has to deal with the operational knowledge required for server care and feeding.  Customers won’t have to.  Second, these cloud vendors are no small shakes.  Names like Amazon, Google, and Microsoft are bandied about.  The fear of dealing with some Open Source vendor that is viewed as too small and flakey is greatly ameliorated.  An additional sweetener is that the competitive strength of users of such services would be much greater versus their competitors who still use expensive Oracle and have to manage the servers themselves.

Despite interest by folks like Nick Car in LiveDocuments, breaking Oracle’s strangehold on the database world is a more likely spot for the disruption of the old school to show up first.  It would mark quite a change.

Posted in amazon, data center, enterprise software, grid, platforms, strategy | 3 Comments »

Amazon Web Services Continues to Mature, Google to Follow Soon?

Posted by smoothspan on November 7, 2007

Amazon recently announced the availability of S3 in Europe, something that customers have been clamoring for.  This is both to reduce latency for US-based companies and to make it easier for European companies to embrace the service.  Presumably Amazon themselves have datacenters in all sorts of nifty places, and an East Asian center would be a worthwhile next step as they continue to roll out the service.  This new capability works with a “local constraint” that identifies where the S3 storage bucket is to be located.  The default remains US datacenters.

When I attended Amazon Startup Project, there was a lot of interest among developers present for some ability to gain a little control over exactly where the machine resources they were purchasing from Amazon materialized.  This manifested at a couple of levels.  First was the ability to access multipel data centers for redundancy.  The “local constraint” option could be expanded to include East and West Coast US locations, for example.  The second request was an ability to specify that machines could be in the same datacenter, but that they ought to be on separate racks, again to increase resilience in the event of a failure that impacts the whole rack.  Note that S3 already has a lot of this kind of redudancy built in, and it is more EC2 (the ability to buy raw Linux machines) that we’re dealing with here.  I can imagine Amazon will get to all of this in the fullness of time.  The requests are not unreasonable nor should they be all that difficult to implement. 

Meanwhile Red Hat has announced SaaS pricing for their Red Hat Linux when offered on EC2, an interesting development.  It sounds like a good thing, but I’m still trying to decide whether their pricing makes sense.  It’s $19/month per user plus 21 to 94 cents per compute hour.  In exchange you get Tech Support and access to all the RHEL (Red Hat Enterprise Linux) apps.  My problem is they want to charge $19/month per user plus an additional hourly charge that’s as much as Amazon wants for the EC2 hardware itself (at least the “small” configuration).  That sounds like a lot, particularly the “per user” piece.  Perhaps they meant to say “per server”, but that isn’t how the release is worded.  We’ll have to wait and see if there is clarification later.  The bottom line on this is that systems software companies are starting to take notice and view Amazon as another platform to support.

Speaking of systems software, we’re still waiting to see a persistent database solution.  When I attended Amazon Startup Project, they mentioned some sort of persistent database support would be available  by end of this year, and at least one of the entrepreneurs who spoke said they were beta testing the solution.  This is a gaping hole in their offering, and one I’m sure they’ll be filling as soon as they can.  What remains to be seen is whether they offer a solution developed by Amazon, or whether a partner steps up.  For example, mySQL could offer something along the lines of what Red Hat is doing.  In looking at the Red Hat pricing, and thinking about some of the things I’ve heard about mySQL (1 in 10,000 “customers” actually pays for mySQL), I wonder if this sort of thing doesn’t provide an opportunity for these vendors to deal themselves a new hand in terms of how they deal with customers.  It wouldn’t be the first time a transition to a SaaS model radically changed the rules.  We’ll have to see how it all works out.

Meanwhile, Amazon keeps ticking along with a pretty good pace of announcements around the service.  Recall we’ve gotten bigger servers, and SLA’s, two things that were much in demand around the time I went to Startup Project.  On the SLA front, Amazon is coming through with flying colors.  Read/Write Web tells us they’re hitting four 9’s, which is an extra “9″ on what their SLA’s promise.  That’s a solid number that even big companies struggle to achieve.

There are two areas of challenge that I’ll be interested to watch as events continue to unfold.  First, there remains a suspicion that Amazon Web Services are largely a remaindering service and that they aren’t even trying to make money with it, but rather recover costs on low server utilization for the rest of their business.  If this is true, then at some point service levels will degrade as the excess capacity is used up and Amazon fails to invest in keeping ahead of that curve.  While it’s possible that this is the case, I’m skeptical.  I think their current pricing actually does let them make money.  There is certainly a fair amount of premium being charged relative to other hosting services.  They must have relentlessly driven service costs down and invested in nearly total automation of the infrastructure.  If the service is profitable or nearly profitable, then we can count on them to keep investing in it as it grows.

This brings me to my second thought.  So far, Amazon has largely focused on delivering capabilities they had to build for their core business anyway.  At some point, the average customer’s needs will deviate from Amazon’s view of how web architectures should be built.  They’ve said, for example, that there are no plans to offer their keyed virtual storage system Dynamo as a service.  There may be a lot of reasons for this.  Ironically, it may not be multitenant, so they may fear opening it up is too risky for the overall business: not enough isolation between tenants.  An alternative view may be that for all but the very largest of web sites, a service like Dynamo is just not necessary.  Most sites want mySQL or the equivalent.  Whether we choose to view that choice as enlightened or not is immaterial.  Many very large web sites get built around vanilla relational technology and they wind up working fine without anything exotic like Dynamo. 

My question on all of that is how much further will Amazon go?  If they’re just milking technologies they’ve built that have broad applicability, they will have a decision to make.  That decision is how heavily to invest in technologies to be delivered via Web Services that have no benefit for the rest of Amazon’s business.  My guess is they’ll go slowly on such investments, preferring to see partners develop the technologies.  We’ll see an interesting first test when we see what happens around persistent database support.

All in all, the service continues to have a bright future.  People who want to directly equate the raw cost of servers to the cost of on-demand utility computing are not making an apples to apples comparison in my mind.  They miss a lot of the benefits that services like Amazon offers that are just not available in a raw server or even a virtual appliance setting.  Businesses have to decide how much they value those services, but early indications for S3, Amazon’s highest value-add service, are very positive.  If you don’t see the additional value added, go with a different alternative.  There are many options in today’s market, with more all the time.

Speaking of more all the time, I’ve been hearing rumblings that Google may announce their equivalent shortly.  I’m all ears for that one!  I will be interested to see if it’s more Google vaporware (i.e. you won’t really be able to use it until end of 200 8) or if it’s something that’s ready to go immediately.

Related Articles

Update on Red Hat:  the add-on pricing is per server, not per user, as I had speculated.

Posted in Web 2.0, amazon, platforms, saas, strategy | 1 Comment »

Amazon Beefs Up EC2 With New Options

Posted by smoothspan on October 16, 2007

I’ve been a big fan of Amazon’s Web Services for quite a while and attended their Startup Project, which is an afternoon seeing what it can do and hearing from entrepreneurs who’ve built on this utility computing fabric.  Read my writeup on the Startup Project for more.  Amazon has been steadily rolling out improvements, such as the addition of SLA’s for the S3 storage service.  Today, there is big news in the Amazon EC2 camp:

Amazon has just announced two new instance types for their EC2 utility computing service.  The original type will continue to be available as the “small” type.  The “large” type has four times the CPU, RAM, and Disk Storage, while the “extra large” has eight times the CPU, RAM, and Disk.  The large and extra large also sport 64 bit cpus.  Supersize your EC2!

Why do this?  Because the original small instance was a tad lightweight for database activity with just 1.7GB of RAM while the extra large at 15GB is about right.  Imagine a cluster of the extra large instances running memcached and you can see how this going to dramatically improve the possibilities for hosting large sites.

One of the neat things about this new announcement is pricing.  They’ve basically linearly scaled pricing.  Whereas a small instance costs 10 cents per instance hour, the extra large has 8x the capacity and costs 8×10 cents or 80 cents per hour.

What’s next?  These new instances open a lot of possibilities, but Amazon still doesn’t have painless persistence for databases like mySQL.  If you are running mySQL on an extra large instance and the server goes down for whatever reason, all the data on it is lost and you have to rebuild a new machine around some form of hot backup or failover.  That exercise has been left to the user.  It’s doable: you have to solve the problem in any data center of what you plan to do if the disk totally crashes and no data can be recovered.  However, folks have been vocally requesting a better solution from Amazon where the data doesn’t go away and the machine can be rebooted intact.  I was told by the EC2 folks at the Startup Project to expect 3 announcements before the end of the year that were related.  I’m guessing this is the first such announcement and two more will follow. 

There’s tremendous excitement right now around these kinds of offerings.  They virtualize the data center to reduce the cost and complexity of setting up the infrastructure to do web software.  They allow you to flex capacity up or down and pay as you go.  Amazon is not the only such option.  I’ll be reporting on some others shortly.  It’s hard to see how it makes sense to build your own data center without the aid of one of these services any more. 

Posted in Web 2.0, amazon, ec2, grid, multicore, platforms, saas, software development | 2 Comments »

Amazon Launches SLA’s for Web Services

Posted by smoothspan on October 8, 2007

Amazon has just announced an SLA policy for S3 that’s retroactive to October 1.  Arthur Bergman on O’Reilly and others have been grumbling about Amazon’s lack of Service Level Agreement, although that doesn’t seem to have impacted the enthusiasm of a lot of startups using the services that I talked to.  Nevertheless, it is evidence that they’re quite serious about encouraging others to use their Platform as a Service.

Some competitors, like Flexiscale, had been touting Amazon’s lack of SLA as a big advantage for their own services.  Scratch one advantage!

The way the SLA works is that if Amazon doesn’t meet their commitment to 99.9% up time, you’re entitled to get back up to 25% of the fee for the month, depending on the details of what actually happened.  That’s a pretty normal SLA, and it guarantees that Amazon can’t make a profit on the service (I doubt their margins are much higher!) unless they can keep to their SLA’s, so they’re properly motivated.

Related Articles

SmugMug is happy, they use Amazon S3, although they haven’t been worried about the lack of SLA’s in the past.

Posted in Web 2.0, amazon, business, platforms, saas, strategy | 6 Comments »

To Escape the Multicore Crisis, Go Out Not Up

Posted by smoothspan on September 29, 2007

Of course, you should never go up in a burning building, go out instead.  Amazon’s Werner Voegels sees the Multicore Crisis in much the same way:

Only focusing on 50X just gives you faster Elephants, not the revolutionary new breeds of animals that can serve us better.

Voegels is writing there about Michael Stonebreaker’s claims that he can demonstrate a database architecture that outperforms conventional databases by a factor of 50X.  Stonebreaker is no one to take lightly: he’s accomplished a lot of innovation in his career so far and he isn’t nearly done.  He advocates replacing the Oracle (and mySQL) style databases (which he calls legacy databases) with a collection of special purpose databases that are optimized for particular tasks such as OLTP or data warehousing.  It’s not unlike the concept myself and others have talked about that suggests that the one-language-fits-all paradigm is all wrong and you’d do better to adopt polyglot programming.

I like Stonebreaker’s work.  While I want the ability to scale out to any level that Voegels suggests, I will take the 50X improvement as a basic building block and then scale that out if I can.  That’s a significant scaling factor even looked at in the terms of the Multicore Language Timetable.  It’s nearly 8 years of Moore’s Cycles.  I’m also mindful that databases are the doorway to the I/O side of the equation which is often a lot harder to scale out.  Backing an engine that’s 50X faster sucking the bits off the disk with memcached ought to lead to some pretty amazing performance.

But Voegels is right, in the long term we need to see different beasts than the elephants.  It was with that thought in mind that I’ve been reading with interest articles about Sequoia, an open source database clustering technology that makes a collection of database servers look like one more powerful server.  It can be used to increase performance and reliablity.  It’s worth noting that Sequoia can be installed for any Java app using JDBC without modifying the app.  Their clever monicker for their technology is RAIDb:  Redundant Array of Inexpensive Databases.  There are different levels of RAIDb just as there are RAID levels that allow for partitioning, mirroring, and replication.  The choice of level or combinations of levels governs whether your applications gets more performance, more reliability, or both.

Sequoia is not a panacea, but for some types of benchmarks such as TPC-W, it shows a nearly linear speedup as more cpus are added.  It seems likely a combination of approaches such as Stonebreaker’s specialized databases for particular niches and clustering approaches like Sequoia all running on a utility computing fabric such as Amazon’s EC2 will finally break the multicore logjam for databases.

Posted in Open Source, amazon, ec2, grid, multicore, platforms, software development | 3 Comments »

Guido is Right to Leave the GIL in Python, Not for Multicore but for Utility Computing

Posted by smoothspan on September 14, 2007

There’s been a lot of back and forth in the Python community over something called the “GIL” or Global Interpreter Lock.  Probably the best “get rid of the GIL” argument comes from Juergen Brendel’s post.  Guido, the benevolent dictator of Python has responded in his own blog that the GIL is here to stay and he doesn’t think it is a problem nor that it’s even the right choice to try to remove it.  Both combatants have been eloquent in expressing their views.  As is often the case, they’re optimizing to different design centers and likely will have to agree to disagree.

Now let’s try to pick apart this issue in a way that everyone can understand and make sense of for large scalability issues in the world of SaaS and Web 2.0.  Note that my arguments may be invalid if your scaling regime is much smaller, but as we’ve seen for sites like Twitter, big time scaling is hard and has to be thought about carefully.

First, a quick explanation on the GIL.  The GIL is a bit of code that causes multiple Python threads to have to wait before an object can be accessed.  Only one thread may access an object at a time. 

Whoa!  That sounds like Python has no ability to scale for multiple cores at all!  How can that be a good thing?  You can see where all the heat is coming from in this discussion.  The GIL just sounds bad, and one blogger refers to it jokingly as the GIL of Doom.

Yet all is not lost.  One can access multiple cpu’s using processes, and the processes run in parallel.  Experienced parallel programmers will know the difference between a process and a thread is that the process has its own state, while threads share their state with other threads.  Hence a thread can reach out and touch the other thread’s objects.  Python is making sure that when that touch happens, only one thread can touch at a time.  Processes don’t have this problem because their communication is carefully controlled and every process has its own objects.

Why do programmers care about threads versus processes?  In theory, threads are lighter weight and they can perform better than a process.  We used to argue back and forth at Oracle about whether to use threads or processes, and there were a lot of trade offs, but it often made sense to go for threads. 

So why won’t Guido get rid of the GIL?  Well, for one thing, it was tried and it didn’t help.  A new interpreter was written with fine-grained locking that minimized the times when multiple threads were locked out.  It ran twice as slow (or worse on Linux) for most applications as the GIL version.  The reason is that having more lock calls was slower:  lock is a slow operating system function.  The way Guido put this was that on a 2 processor machine, Python would run slightly faster than on a single processor machine, and he saw that as too much overhead.  Now I’ve commented before that we need to waste more hardware in the interest of higher parallelism, and this factor of 2 goes away as soon as you run on a quad core cpu, so why not nix the GIL?  BTW, those demanding the demise of the GIL seem to feel that since Java can run faster and supports threads, that the attempt at removing the GIL must have been flawed and there is a better way.

I find myself in a funny quandry on this one, but ultimately agreeing with Guido.  There is little doubt that the GIL creates a scalability speed bump, but that speed bump is localized at the low end of the scalability space.  If you want even more scalability, you still have to do as Guido recommends and use processes and sockets or some such to communicate between them.  I also note that a lot of authorities feel that it is also much harder to program threads than processes, and they call for shared nothing access.  Highly parallel languages like Erlang are focused on a process model for that reason, not a thread model.

Let me explain what all that means.  Threads run inside the same virtual machine, and hence run on the same physical machine.  Processes can run on the same physical machine or in another physical machine.  If you architect your application around threads, you’ve done nothing to access multiple machines.  So, you can scale to as many cores are on the single machine (which will be quite a few over time), but to really reach web scales, you’ll need to solve the multiple machine problem anyway.

As Donald Knuth says, “premature optimization is the heart of all evil in programming.”  Threads are a premature optimization when you need massive scaling, while processes lead to greater scalability.  If you’re planning to use a utility computing fabric, such as Amazon EC2, you’ll want processes.  In this case, I’m with Guido, because I think utility computing is more important in the big picture than optimizing for the cores on a single chip.  Take a look at my blog post on Amazon Startup Project to see just a few things folks are doing with this particular utility computing fabric.

Submit to Digg | Submit to Del.icio.us | Submit to StumbleUpon

Posted in Web 2.0, amazon, data center, ec2, grid, multicore, platforms, saas, software development | No Comments »