SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Archive for the 'data center' Category


Hurry, The Cloud Computing Platform Opportunity is Perishable!

Posted by smoothspan on April 7, 2008

As I write this post, many are predicting that the big announcement from Google tonight will be that it’s opening up BigTable for the world to use.  At least Kevin Burton and Mike Arrington think so.  I hope so, because the world needs a lot more cloud computing choices.  I wonder how many have figured out just how little time remains to introduce new cloud computing platforms?

Ray Ozzie has said, “[the cloud market] really isn’t being taken seriously right now by anybody except Amazon.”  He’s right on the mark:  it isn’t being taken seriously by anyone except Amazon.  The distant runner up is Benioff’s Force.com.  I say distant because there are a lot of problems with it, not the least of which is an economic model that makes it completely untenable for anyone but big corporate IT to use.  Technically, it is a completely closed and proprietary environment that offers only minimal leverage.  It’s true, they’re very seirous about it, so in that sense we should add them to the list, but the way they’re going about it makes it seem less than serious.

Here’s an important tip for various big industry players who’ve made noise about Cloud Computing at various points:  it’s a perishable opportunity!  You don’t have forever to contemplate how to get in and start winning.

Why?

Because ultimately it boils down to differentiation and commoditization like any market.  The longer you wait, the more bipolar the market becomes.  Allow Amazon to get too strong and you’ll have two choices:

-  Copy Amazon’s API’s very closely and charge a lot less. 

-  Launch a radically different approach that offers big advantages in some other way.

The middle ground will be untenable.  An API or service that is only slightly better than Amazon’s but is incompatible won’t succeed.  We’ve seen this time and time again in our industry.  It’ll play out the same way here.  For a brief time everyone can be slightly different.  Then the world will discover the differences don’t matter and they’ll gravitate towards one player.  If someone already has huge momentum (e.g. Amazon), you must either be incredibly differentiated or much cheaper.  Both are pretty hard to do.

We could ask whether Amazon has already reached a stage that only the two options can fly.  I don’t think so.  Not quite anyway.  It takes longer than you’d think, although their success has been phenomenal.  My prediction is that the window to introduce a major new cloud computing platform initiative is not quite 2 years.  If you’re not out by end of 2009, you will face a major uphill struggle.  In fact, if you’re not a great big player, the window is much less.

There are significant challenges for the big players to execute quickly enough:

-  Sun never seems to execute on anything quickly enough.  Sorry guys, but the company just doesn’t evolve very fast.  That’s why you’re buying properties like MySQL, right?

-  Google wants to be a precision machine, focused on squeazing margin out of a lucrative model.  What would they do, if like Amazon, they announced this thing and suddenly had more traffic to it than their core properties?  They have a history of absorbing startups and then taking a long time to get the thing to a level they feel is commensurate with their standards.  Cloud computing is in many ways worse.  They lose control and let other people’s software run inside their firewalls on their servers. 

-  Microsoft is in the unenviable position the old RISC world was in against Intel.  They have to build everything themselves on their platforms.  There is no synergy with third parties.  It’s ironic really.  The Intel/Microsoft PC Kiretsu could divide and conquer and they were so successful even Apple finally went Intel because the others couldn’t afford to do it themselves.

-  Yahoo?  People used to talk about them in the same breath, but clearly the wheels are coming off that stagecoach.  For a big player, cloud computing is not a little investment.  Particularly now when there is quite a lot of momentum already built.  Yahoo’s bets are laid, and they’re a lame duck besides.  Count them out.

-  IBM?  Could be.  They’ve made announcements but the follow up is weak.  IBM could certainly afford to throw enough services at the problem to get it going until the technology catches up.  They can sure sell such a thing.  The biggest challenge they have is their command and control culture may never let it reach critical mass.

-  Tata et al:  Big Indian or Chinese.  Why not?  These are huge companies overseas.  They have the expertise to do quite a lot.  The Asian markets are hot, hot, hot, and they’re not that well served by Amazon.  These guys would be my bet for the odds on Dark Horse players if they get it and can get their act together.  They’re ideal as low cost providers and like IBM, they can throw service at it until they get it right.  There is surprisingly little technology required at this stage to get started at the level Amazon is at.  You need an EC2 and an S3 clone and a bit of window dressing that does something they don’t.  How about an identity system?  I’ve written about that before.  Wouldn’t you think if a service was announced business would fly to it overseas?

Meanwhile, Amazon is coming to a sort of crossroads as well.  The traffic to Amazon Web Services exceeds the traffic to the rest of their properties combined.  This is no longer a remaindering strategy for unused MIPs as many VC’s I talked to late last year seemed to feel.  Amazon is now experiencing significant growth and scaling pains for the service.  EC2 just went down for about an hour for many customers.

This is both good news and bad news for Amazon.  The good news is that they’re learning how to keep these systems up and they others haven’t even started up that learning curve.  The bad news is it annoys customers mightily. 

The other thing I watch Amazon for is signs they’ll offer anything with AWS that they didn’t already have to build for their core business.  The availability of something interesting and new would be a further signal that this is not just a remaindering business.  More importantly, it would be a further barrier to entry and exit around their valuable property.  As it stands, EC2, S3, and SimpleDB are pretty low level.  They do not represent big barriers.  All that is available in one form or another via Open Source to others who want to play.  Amazon’s expertise in billing and payment processing is more differentiated, but not compelling and as currently offered, very Amazon-centric.

Note to Werner Vogels:  it’s time look for key innovations in AWS to build lock in while you continue to make the service more robust.

Note to others:  Time is running out.  Get in the game or move on.

Note to self:  Look for a dip and buy AMZN stock.

Related Articles

Google responded well to the challenges I set forth above with App Engine.  See my blog post for more details.  By focusing on language support instead of raw virtual machines, they’ve actually raised the bar in the sort of way I keep saying Amazon needs to above and below in the comments.  I stick to my 2 year prognosis.  If you aren’t a Big Player here within 2 years, the window will close.  What Google has done is raise the ante on what you must deliver to be in the poker game.

Posted in Web 2.0, amazon, data center, ec2, platforms, saas | 12 Comments »

Amazon Ran Out of Capacity

Posted by smoothspan on February 18, 2008

As I suggested in my original post on the topic, Amazon’s recent S3 outage was due to running out of capacity.  Specifically, they ran out of authentication capacity.  In part, this problem was due to the fact that Amazon wasn’t monitoring exactly this part of their capacity envelope very well.  High Contrast has the Amazon quote telling us that it was also due to just a few customers radically increasing their load on the system in an unpredictable way:

the surge was caused by at least one very large customer plus several other customers suddenly and unexpectedly increasing their usage. 

So far, most of the pundits are in something of a denial mode.  They argue that nothing really new and interesting is happening here.  All services go down, including the electric companyVinnie Merchandani says corporate data centers have been going down a lot more often than 99.999% uptime allows for since forever.  Folks like Nick Carr seem to feel the biggest issue in this outage was that users didn’t have timely information and Amazon is fixing that.

This all misses a bigger point.  What these writers are doing is attempting to apply the old standards and methods against the new world of Cloud Computing.  The trouble is, there is something genuinely new at work here that goes beyond the inevitability of some outages and the need to be more transparent with customers about what is going on.  The problem Amazon and other would-be cloud platform purveyors face is predictability.  The world they deal in is radically less predictable than corporate data centers of old because the Internet today has much lower friction and higher connectivity between different web sites that make load spikes increasingly sudden and intense.  There is a cascade of dominoes effect that is enabled by the low friction web that wasn’t nearly so twitchy in the past. 

The premise of any large computing infrastructure is that by sharing the load across many customers (and in Amazon’s case, sharing excess capacity from their core retail business), we enable headroom for such load spikes.  But how realistic is that concept?

Consider this Alexa plot of CNN and Flickr traffic over time:

 Flickr Traffic

Do these two curves look predictable to you?  Take CNN, for example.  To handle the big spikes requires 2-3x overload capacity.  Flickr is a little less crazy except for one massive event that involved a doubling in a very short time.  This latter even was permanent in its effect, so if you were counting on temporarily borrowing some headroom, you would have had to keep it in place indefinitely and grow from there.  Ironically, that chart was brought to my attention at Amazon Startup Project where they used it to sell the idea of unlimited headroom a startup can’t afford to purchase by using Amazon Web Services.

These charts are displaying non-linear behaviour, the hardest of all phenomena to predict.  This non-linearity is becoming more and more common because the Internet has become extremely viral.  It is crosslinked, the very meaning of the word “web”, and messages travel along the links with almost no friction.  Viral has become a virtue, and much of the current innovation is focused around how to make the viral spread of information more likely.  Social Networks are all about such behaviour.  Take a look again at those CNN spikes.  Now let’s imagine your cloud computing infrastructure is hosting a bunch of different blogging, micro-blogging, video, photo sharing, and other social sites.  The CNN spikes no doubt represent something newsworthy happening.  The greatest likelihood is that each spike will be echoed at some level across all of these sites that are in the business of spreading information.  Friction has been lowered to the point it is almost non-existent when it comes to the spread of memes on the Internet.  We have major spikes from world events, such as the assassination of a world leader.  In the Internet, we can have major spikes from such inane moments as Scoble shedding tears of delight over new Microsoft secret software.  And the whole thing is wired together.  That one tear on Scoble’s cheek breeds a thousand or more accounts ranging from poking fun to trying to guess what this secret software is.  There is a ravenous beast poised over the keyboard waiting for something interesting to pass onto its network of other ravenous beasts.

This is decidedly non-linear behaviour and impossible to predict.  The answer is major cloud computing infrastructure providers will need to have considerable excess capacity available on tap at all times to avoid outages.  Take Amazon.  Web bandwidth to their web services now exceeds to total traffic to all of their other properties.  What might have once been a nice remaindering business allowing them to resell their excess capacity is now driving the need for more capacity.  They have just a few choices.  They can invest in a lot more hardware and lower the margins on their business, or they can implement some strategies to limit the availability of the service to some customers.  It strains credulity to think they’ll limit capacity to their retail business.  How will they decide?  Tiered pricing of some kind? 

Think in terms of other unexpected networked events.  I’m reminded of financial markets and the law of unintended consequences.  Look at today’s housing market.  Remember Long Term Capital, a hedge fund with Nobel Laureates who had mathematical proofs they would continue making money.  Right up until they unpredictably went bankrupt.  BTW, this sort of thing used to happen with the electrical grid too.  In both cases, the financial markets and the electrical grid, elaborate means were put into place to artificially inject friction to damp the machine’s oscillations before it could destroy itself.  There are elaborate rules in the stock exchanges about shorting stocks that are falling.  They inject a form of friction back into those markets to prevent total free fall. 

Perhaps this points the way to new technology for Cloud Computing infrastructure.  A gentle injection of the right kind of friction at the right point for a limited time might prevent suddenly massive spikes and outages.  It’s an area ripe for innovation.  Meanwhile, Amazon could sorely use some competition.  If a customer could contract for emergency capacity from elsewhere, or even better, if the Cloud Computing Providers could share slack capacity as the electrical companies do, it would be tremendously helpful when the inevitable load spikes arrive.

Posted in Web 2.0, amazon, data center, platforms, saas | 3 Comments »

Google Reports iPhone Usage 50x Other Handsets; Amazon S3 Goes Down: Low Friction Has a Cost

Posted by smoothspan on February 15, 2008

As I write this post there are two articles that caught my eye.  For most, the iPhone and Amazon’s Web Services have little to do with one another, but I see a bit of a pattern here that’s interesting.

Slash Lane of Apple Insider reports that Google was shocked that is was seing 50 times more search requests coming from Apple iPhones than any other mobile handset — a revelation so astonishing that the company originally suspected it had made an error culling its own data.  It’s an amazing statistic, really.  But I can attest to hitting Google quite a lot myself whenever I’m out and about and killing time before the next meeting.  In fact, I am very pleased to have my bookmarks out on a web page rather than in my browser so I can easily access all of my favorite sites from whatever device is at hand.  The iPhone is quite a credible web browser.  I can’t wait for the 3G version and higher speeds.

Following closely on my read of the iPhone piece is Nick Carr’s article about an Amazon S3 outage.  Nothing all that earth-shattering or unexpected, just that S3 was out for several hours this morning, beginning at 7:30am EST.  The gist of the article is that while the outage was to be expected, Amazon did a poor job keeping users informed of what was going on and providing explanations after the fact.  Carr is right, of course, but business is always embarassed when things go wrong and the first (and wrong) human instinct is to be shy about details.

Why do these two go together?  I’ll give you a hint:  the tales of Facebook applications reaching millions of users in an incredibly short time also goes with the theme I’m thinking of.  That theme has to do with friction.  Friction is my word for all the factors that slow adoption.  The time needed for word of mouth, decisionmaking, purchase, installation, getting through the learning curve, and finally being a first class citizen of whatever community results is governed by the degree of friction.

One of the things the Internet does is reduce friction.  In its most extreme, friction actually reverses and becomes a propelling force.  We call that viral marketing.  Most of the innovations in this second Internet round (post-bubble) have been focused on reducing friction.  Social Networks, for example, dramatically reduce the friction of networking.  Twitter dramatically reduces the friction of blogging, right down to limiting the article length to 140 characters so you don’t have to labor over the wordsmithing.

While it’s harder, the web is also a powerful means of reducing friction for more physical things.  The iPhone and Amazon Web Services are two great examples.  In an extremely short time the iPhone has racked up 50x the usage of other competing handsets for the Internet.  The traffic to AWS in approximately the same short time now exceeds the combined traffic for all other Amazon properties.

While the web itself helped to spread the word, I think it is no coincidence that these two have a lot to do with the web and offer a lot of value back to the web.  It’s what some folks call a virtuous circle.  Look for more of these as time goes on.

Now that cost side.  These growth rates are not predictable.  Nobody would have guessed that either business would get so big so fast.  In fact, many guessed just the opposite.  Even if you did guess it could happen, it would only be a guess that it could, not that it would.  A prudent business would not invest in infrastructure built to the level and assumption that it would happen.  That means there will be painful outages from time to time.  Hopefully, the infrastructure owners will take those outages as signs that its time to double down and extend their projections of what might happen much further up and to the right.  Those that succeed in keeping hold of the Tiger by the Tail will survive and prosper.

Posted in Marketing, Web 2.0, amazon, data center, grid, multicore | 5 Comments »

When Do The SaaS Acquisition Games Begin? (A Primer on Cloud Computing Market Segments)

Posted by smoothspan on February 12, 2008

The Yahoo/Microsoft business has turned to utter farce.  Michael Arrington’s line left me in stitches:

Wait. Yahoo and AOL? I Was Looking Forward To Something More…Fierce.

Mathew Ingram calls it “desperation squared.”  We have now moved from the factual to the sublime: a sure signal to Yahoo that they need to get on with being acquired.  When most of the world is laughing at you, and you are a huge company, it means you’ve lost it.  You’re way past the point of return.  But this is not why we’re here, for the Giants are thinking of dipping into another branch of the Cloud Computing Tree.

Tom Foremski says that Oracle recently approached Salesforce.com to gauge their interest in a possible $75/share offer.  Duncan Riley at Techcrunch finds the rumor plausible, as do I.  I won’t spend a lot more time on this particular scenario.  It will be a question of Oracle’s resolve to buy versus Salesforce’s resolve to remain independent.  But I will say this.  Oracle typically spends 7-8x maintenance revenue to buy companies.  If the rumor is true, they’re offering 13x trailing twelve months total revenue for Salesforce.  It just goes to show the awesome financial power of a good SaaS business.  It’s likely worth that much.  After all, if Oracle is ever going to get started on the road to SaaS (yes, I know, they have a SaaS business already, yada, yada, but not really), starting from a seed as close to $1B a year as possible would help accelerate things.  That’s a real problem, BTW: there just aren’t all that many SaaS properties out there yet for acquirers to choose from.  The space isn’t very far along, and is still very young.

And yet there are machinations going on as various players try to position themselves for the coming battles.  Some of these manuevers are visible, some are just off the edge where the light is pretty dim.  It’s important to segment the Cloud Computing and SaaS market to gain a better understanding of the terrain.  We’ll leave aside the Web 2.0 world of Facebook et al, though the infrastructure at the bottom of the market segmentation model I present is the same for the Consumer/Web 2.0 world.  Markets tend to consolidate from the bottom of the technology stack up.  The reason is that the bottom layers have been around a lot longer, there are more big players, and momentum there has often slowed.  These are sure signs that a consolidation is in order.  It’s important to know where you are in the stack because it equates to where you are in the M&A food chain.  Consequently, VC’s often try to evaluate how near the bottom an idea is versus how late in the day it’s getting.  Being too low in the stack when the market is very mature is usually a bad thing.  Being high up early is oddly almost never a bad thing.  The very top of the stack is apps, and it takes apps to propel the other layers forward.

All things considered, if you have a killer idea for an app, that’s where you should place your bets.  That would be another reason for Oracle to pay a premium for Salesforce.  The other thing to keep in mind is that the line of safety keeps moving upward.  The snapshot I’ll portray today has that line hovering at the Value Added Hoster level.  It won’t be long before it moves up a notch to encompass the Virtualizers.

The Battle for SaaS Hosting and Platform Dominance

At the very bottom of the SaaS stack are the hosters and platform builders.  There are several armies on the battlefield jockeying already.  There are roughly three market segments:

saashostsegments.jpg

 

First are the old-school hosters that basically offer raw machines and Internet connectivity: “A Cage and a Pipe.”  These guys are very long in the tooth for the current Cloud Computing era.  The trouble is they are experts on the physical plant but don’t add much value otherwise, and their expertise is now heavily commoditized.  If they don’t learn to offer more value soon, their days are numbered, hence they’re in the “red” zone.

Next up are the value added hosters.  Start with a Cage and a Pipe and add Some Service.  Perhaps that’s as simple as providing system administrators and DBA’s.  Service can become more elaborate.  This group is currently a very popular choice for SaaS startups I talk to.  Very few of these companies are considering the Red Zone.  But the Value Added Hosters need to move upstream as fast as they can, lest they start to go red too.  The services they offer are not hard for the Cage and Pipe crowd to bring on.  There is so far minimal proprietary technology adding value.  Aside from the problem that others can add services, it creates a secondary problem that the cost to deliver the service is higher.  We’ve talked before about how much more efficient SaaS players have to be than conventional users of Enteprise Software.  The Yellow Zone is borderline in that respect.

It shouldn’t be surprising, therefore, when we read things like OpSource’s acquisition of billing company LeCayla.  It gives them technology and a new service to inch them closer to the Green Zone.

This brings us to the Green Zone, which I have dubbed “The Virtualizers.”  Virtualization is their chief technology differentiator, although there is often a whole lot more.  These players want to bring on as many generic components as they can to complete a full Platform as a Service offering.  This is the most interesting and vigorous space, and I predict it represents the future.  If the Red and Yellow Zones can’t find a way to get there, they’ll find themselves increasingly commoditized and marginalized, making their segments very tough businesses indeed.  The Green Zone brings a number of essential advantages, although every player doesn’t offer every advantage. 

One of the big advantages is true On-demand computing.  With Amazon and many others you can buy servers buy the hour as needed to deal with load spikes of various kinds.  This leads to a tremendous savings for most organizations, and makes it possible for startups to pay the big bucks only if they’re successful and have the big bucks.  It’s a radical reduction in friction, and that almost always leads to radical growth.  So it is here.  Amazon recently reported more web traffic going to Amazon Web Services than the rest of Amazon’s properties combined. 

Companies like 3Tera (check out my 3Tera interview posts) and Q-Layer offer such virtulized data centers in the form of software.  Buy their software and you can create a virtual datacenter.  Or you can buy the hosting as well from these companies and their partners.  They’re very important players because they represent the means by which the Red and Yellow Zones can become Green.

Sun deserves special mention after their purchase of MySQL.  If I were being completely objective, Sun is still very much in the Yellow Zone.  I’m giving Sun and Jonathan Schwartz the benefit of the doubt in terms of where they’re going.  They do offer Sun Grid, and they certainly have the wherewithal.  Whether the organization can really pull together and get it done remains to be seen, but MySQL is a very promising new jewel in that crown.

SaaS Tools

The level above the platform consists of Tools.  First thing to note about this category is that “Tool” is a dirty word among the VC’s and other money mongering intelligentsia.  The story goes that nobody ever got rich on tools, the world now expects tools to be given away, yada, yada.  BTW, I disagree with that sentiment.  There have been lots of very successful tools companies.  I think the real issue is that it’s hard for the Money Men to evaluate tools.  Everyone promises to be able to turn a noobie programmer into a powerhouse of productivity that can single handedly reproduce SAP’s entire suite over the weekend.  Unless you are extremely technical and immersed continuously in the world of Tools, it’s very hard to separate the hype from the reality and the religion from the irrelevant.  Nevertheless, this is a real category, and there’s actually a lot going on here. 

I break this market into three segments:

saastoolsegments2.jpg

 

At the bottom, just above the Virtualizers from the prior diagram, we have Systems Software, which I’m classifying here as Databases and App Servers.  Normally we would include operating systems, but they’re spread all around and largely play in the Virtualizer category.  In other words, what’s interesting about Operating Systems vis a vis SaaS and Cloud Computing is virtualization features.  This area is dangerously close to the Platforms where most of the Giants are.  Sun has already set a big foot down here with MySQL.  Amazon is trying to change the game entirely with SimpleDB.  There are some players, such as Elastra, that are trying to skate between Amazon and the rest of the world by offering MySQL on Amazon.  My take is that such plays need to get big really fast or diversify into other services because the window here has to be closing.  There is already so much traffic on Amazon, and so many folks using MySQL there, that it seems likely a single solution will emerge and Amazon is in a good position to dictate what that will be.  I can hear Amazon on the phone call now:

Really, you don’t want to sell to us?  Well, we’re going to deliver your product on AWS in about 6 months and it will be the preferred solution for the platform.

Or that call could be to a MySQL competitor.  There are several, and some say products like PostgresSQL are better for various reasons such as scalability.  What would it mean to Sun if Amazon acquired one and built it into their fabric?  What does it mean to others lower in the stack if all the good DB’s get bought and incorporated into the fabric of Giants?  Definite strategic manuevering possibilities here.

Next up are the Languages.  Since the dawn of computing, there have been Language Wars.  A lot of this is about separating the religion from the irrelevant, BTW.  Nevertheless, we have the new school of scripting languages circling the castle of traditional curly braced languages like Java and C++ (not that the new guys are bereft of curly braces!).  Their battering rams are pummeling the iron doors of performance ceaselessly with the promise of productivity.  Cheap among these are PHP, Python, and Ruby on Rails.  There are successes and failures to point to for all of them.  PHP is largely what powers Yahoo and many older web properties.  Python, while Open Source, seems to be the one championed by Google.  After all, they got Guido.  Ruby on Rails is one that I find interesting, because it doesn’t yet have a big power partner.  It’s Open Source, but without the partner, it remains something of a Free Spirit.  Perhaps that makes it an ideal nucleus for an upstart wanting to take on the Cloud Computing Giants.  Heroku would be one such possibility.  I’ve seen a demo, and it surely did seem pretty cool.  The Ruby brand is still strong, and could propel the right offering far.  Zend is working hard to have a go at PHP as well.  BTW, I would put Force squarely in the language category.  Yes, it is all of the layers below too, but there is a rich set of functionality that adds language and framework, not to mention you must use their proprietary langauge.

I can’t move on from Languages without mentioning Salesforce’s Force either.   They view it as a Platform-as-a-Service, but it offers so much more than something like Amazon (so far at least) that it deserves a spot higher in the stack.  Force includes a language that is Java-like, but proprietary to Salesforce.  Most developers these days have a problem with proprietary.  They prefer Open Source.  But that’s not even the real Achilles Heel.  Force is currently way overpriced to make it practical for ISV’s.  As I’ve discussed many times, your Cost of Service needs to be as far below 50% as you can get.  With Force starting out at $50 a seat month, customers must charge $100-200 a seat month to achieve reasonable margins.  That’s largely not possible for ISV’s, so Force is mostly an IT pheonomenon.  That makes it less strategic, but perhaps a better cash cow for Salesforce.

What’s this Enterprise Tools category?

Enterprise IT is used to having a rich ecosystem that fills in the gaps.  When you think about it, purchasing the software application is just a small piece of the overall organism that is created when that app goes into production.  There are many products bolstering and augmenting the application’s functionality.  Don’t like the reporting provided out of the box?  Plug in a Business Intelligence Tool.  Need to integrate the application with other applications without writing too much custom code?  There’s everything from ETL tools ala Informatica to shift data between tables to complex messaging systems from companies like Tibco.  Need help managing logon information and implementing single sign on (SSO)?  There’s LDAP, Active Directory, and a ton of other products out there. 

Almost all of that is gone with Cloud Computing.  As someone quipped, “It isn’t that the data is in THE cloud, it just isn’t in MY data center anymore.”  And in fact, THE cloud is really many clouds: one for each data center of each provider you’re doing business with.  Even more interesting, a lot of the Old School providers of this stuff have technology that isn’t real relevant to the Cloud Computing Era, and many of them have been bought so they can be milked.  Witness all the BI vendors that have been absorbed.  Their time of innovation is done.

That’s actually great news.  The SaaS Enterprise Tools category is the lowest true Green Field opportunity in this model.  Nobody owns it.  The Giants are mostly absent.  And there are even surprisingly few startups about.  Perhaps it just doesn’t seem sexy enough, but there are real problems here that need solving.  I had lunch the other day with Mike Hoskins of Pervasive.  Among many other areas, they do a good business out of software that pumps data out of Salesforce and into your local data center so you can apply your BI tools to it.  I’ve interviewed Ken Rudin of LucidEra for this blog.  They provide BI solutions in the Saas model, largely based on data from Salesforce again.  Another great example is EMC’s recent acquisition of SaaS backup vendor Mozy.

These are good opportunities in this segment.  There are customers with real pain and minimal competition so far.  The Giants are ill-positioned to jump in because of the disruptive business model that is SaaS.  I would expect to see a lot more action here before it’s over, but there is a very interesting move that just took place that seems to have largely been ignored.  Workday, Duffield’s Peoplesoft Version Two, has just acquired SOA integration tool vendor Cape Clear.  I think this is really an interesting move.  Yes, I’m sure they needed to be able to easily integrate a lot of systems outside Workday to sell their application, but I wonder if there is more going on here?  For example, at some point, I expect to see fine grain network effects emerge from the topology of the clouds.  These will be a function of the need to shift data between applications to integrate them.  There’s a real speeds and feeds issue there that has to be addressed.  It will be advantageous to run your software in the same cloud as what it integrates with.   This will favor really big clouds like Amazon’s.  I could also see it triggering partnerships bolstered by high speed dedicated links between data centers.  One example is Joyent’s dedicated link to the Facebook data center, which gives them a real advantage hosting Facebook applets.

Is Workday trying to lock in a part of that future integration pie?  Not clear, but there sure isn’t much else beyond Cape Clear in the space right now and Workday’s application is the kind that wants to be the system of record nexus for everything else.  Dana Gardner discusses how increasingly, it is the Service and not the Software that drives acquisitions like this.  After the merger, you won’t be able to buy Cape Clear except as a Service (now dubbed “Integration as a Service”).  Given that it was a very high quality offering, Cape Clear gives Workday an interesting and valuable differentiator, if nothing else.  One of the big puzzles of SaaS is how to get the more complex domains installed much more cheaply than conventional Enterprise Software.  Integrating with a bunch of Legacy systems can make that really hard unless you have a toolset like Cape Clear to simplify the job.  To the extent the tool is bought to integrate other SaaS vendors, it can serve as valuable lead generation to go sell the primary Workday Suite into Enterprises that clearly have SaaS underway.  All in all, I would rate this as a canny and highly strategic move that Workday has made.

SaaS Enterprise and Desktop Applications

This brings us finally to the topmost slices of the layer cake, applications.  I include here both desktop and Enterprise applications, so it’s everything from spreadsheets and word processors in the cloud to Salesforce.com.  That’s a lot of ground to cover, and it has barely been penetrated.  There are numerous application categories for which there are not yet any SaaS offerings, and many of the offerings that are available are still in their early days yet.  Most of the application companies I talk to are seeing unbridled demand.  It seems likely that for early markets there are enough customers out there in the SaaS early adopter crowd that you can go pretty far just because your offering is SaaS, assuming it works, of course.

What’s Strategic and Who’s Being Left Out?

First, there is an overall megatrend at work here, and that is the move from proprietary to open.  Companies will over time be less and less inclined to run datacenters.  Giant Cloud Centers like Amazon Web Services will be the new black and the New Open for that world.  That Openness will drive throughout the stack in an expanding wavefront, because Open wants to connect to Open.  That makes All Things Open strategic in this Cloud Computing Era.

Second, let’s talk briefly about acquisition strategy.  If your goal is to acquire SaaS market share and scale, there isn’t much available.  Salesforce is the largest pure SaaS vendor and they’re still under a billion in annual revenues, although they’re closing in on it.  That means acquisitions at this stage in the market should be more focused on capturing Strategic Choke Points than cubic dollars.

Let’s review potential choke points:

- Hosting and Platforms:  Look at the 3Tera and Q-Layer offerings as a means of supercharging data centers into the Cloud Era.  There are probably other players I’ve missed, but these guys give a flavor.  Be aware that virtualization is all the rage.  I personally have met 2 different Entrepreneurs in Residence at major Silicon Valley VC’s in just the last month who are focused on virtualization.  There’s a lot of attention here, and we can’t forget VMWare, nor the fact that the OS makers all want to build it into the OS.  The nice thing about something like a 3Tera is that its a lot more than just virtualization.  The real answer is to recast virtualization as a solution, and thereby move up the stack.  Simply Continuous, for example, offers a Disaster Recovery solution based on virtualization.  Those EIRs I mention are also interested in solutions more than generic virtualization.

- Systems Software:  Sun’s purchase of MySQL signalled that consolidation has begun here.  We’re going to see the clash of the Relational DB’s versus the new era SimpleDB-style systems.  I have to expect that all the action over at Amazon will flush others out of the woodwork some time this year, especially Microsoft and Google.  The former may be overly preoccupied with Yahoo and therefore delayed.  As for App Servers, look for Dark Horses specific to the new languages.  Someone who does something really great for Cloud Computing may have a leg up, but I’m not sure how long it will last.  If you want to hang out in this layer, be focused outside the limelight.  LucidEra took over an open source column store DB and focused it around SaaS BI needs.  That’s safely out of the line of fire between MySQL and the SimpleDB’s of the world.  In fact, there are likely more opportunities in the BI-specific space.  Certainly this was very late in maturation for conventional On-premises.  I wonder if someone will build a Teradata equivalent in the Cloud, for example?

- Languages:  This is as low in the stack as I’d want to be innovating unless I had a serious niche picked out.  The world seems to be clamoring for new languages at the moment, so maybe there’s a good shot here.  And so far, nobody is very far along at packaging any of the new languages so they’re easy to use for Cloud Computing.  Stay away from the crowded niche of proprietary “non-programmer” languages.  These are the Bugees, Cogheads, and the like.  They’re really more like dBase or Access in the Cloud than they are Languages in the Cloud.  If one of these players can really hijack a major language and get a big enough lead, it will be interesting.  It’s very hard though, with Open Source.  It levels the playing field unless you’re very careful about how you add value.

- Enterprise Tools:  Huge opportunity here.  There is no compelling generic BI offering for SaaS.  Workday just bought arguably the best SOA offering in Cape Clear.  Yet many application domains require these tools and a whole lot more.  If you are a startup looking to be acquired, think about what services your company could add to the Amazon umbrella.  What are the things that would spread like wildfire among the couple hundred thousand developers who have accounts on Amazon?  Build your solution so it scales well and takes advantage of Amazon’s pricing for communication within their cloud and you could go far.  One thing I think is glaringly apparent and needed, for example, is an OpenID service for Amazon.  There are many many others.  Deconstruct the current On-premises IT ecosystem and see what makes sense for SaaS.

-  Applications:  If you want a strategic choke point, you want to own a system of record.  They’re the blue chip properties in the Enterprise Suites.  That’s because everything else gets its data from some system of record or another.  Let’s not be totally focused on the past though.  The Cloud is ideal for collaboration.  How can you combine a system of record domain with serious collaboration to build a new killer category?  Worth thinking about.

I think I’ve provided a decent framework for thinking about the SaaS world in terms of where the action is, what makes sense for M&A, and where the opportunities may be.  If there’s one thing I’m certain of, it’s that we’re early days on Cloud Computing and there is a lot more opportunity out there than I’ve portrayed in this brief article.  There will also be a lot more change, and market segmentation could be viewed along many more dimensions than the one I’ve portrayed here. 

Food for thought.

Related Articles

Just noticed Cote refers to the folks at the bottom of my stack as the “Morlocks”.  Remember the nasty troglodytes from H.G. Wells the Time Machine?  I don’t think the Morlocks are all that likely to eat the “Blond People” who are apparently the SaaS applications, but stranger things have happened!

I just watched the Google App Engine announcement.  It places them at the language level, which is a big leap up the stack I’ve drawn in this post.  It really raises the stakes for those playing at the lower levels!  See my post for more.

Posted in Web 2.0, amazon, data center, grid, saas, strategy | 10 Comments »

Apple, MacWorld, User Experience, and the Multicore Crisis

Posted by smoothspan on January 16, 2008

Looking over the parachute drops of information from MacWorld, I was struck by some underlying themes.  I won’t bore you with a recitation of the huge amount of surface level activity: plenty of better more firsthand places to get that.  But some of those first hand sources excited some patterns I’m familiar with.

First, the multicore crisis bit.  I’ve written about it before, but let me recap.  What is the multicore crisis?  It is a wave of change that is being unleashed by virtue of the fact that microprocessors have stopped getting faster every 18 months.  Instead of gaining a faster clock speed with free benefits for all at scarcely any effort, we get more cores.  That ain’t bad, but it takes considerable effort at the software end to take advantage of the additional cores.  For the most part, we are far from keeping up with the availability of those cores.  For emphasis, here is a graph of Intel clock speeds that vividly shows just how long the curve has been flattened out:

Clock Speed Timeline

We’ve had another year in 2007 while the curve remained flat.

What does this have to do with Apple and MacWorld?  Well, on a simple vein, it was the multicore crisis checking in that caused Mathew Ingram to write, “Hey, Steve–you broke the Internet.”  He was remarking about how Twitter was virtually unusable for hours.  Twitter has become somewhat of an unwilling canary in the coal mine: if something is hot and getting traffic, Twitter seems bound to go down.  Why?  Because it is a victim of the Multicore Crisis.  The system’s architecture isn’t scaling.  It may be a software problem, i.e. it is not designed to take advantage of enough cpu’s, or an infrastructure problem, i.e. it can only take advantage of the cpus Twitter has physically bought and installed in their data center.  These can both be overcome.  Software can be made to take advantage of lots more processors.  Services like Amazon and others offer let you scale up to many more cpu’s on short notice without having to buy physical hardware.  Failure to provide for both these contingencies is succumbing to the Multicore Crisis.

Twitter was not unique.  Mathew’s blog was very slow to come up when I tried to access the article, having been Techmemed.  He mentions Fake Steve Jobs got creamed and couldn’t make CoverIt Live work (Zoli mentions CoveritLive was CoveritDead).  The Apple store was down at one point too.

Scoble tells a similar story:  Engadget was up but very slow, Qik’s macworld channel was up and down, Mogulus was slow to unreachable.  Live video was hard to come by.  TUAW fairly unreachable.  There were a couple sites that passed muster including TechCrunch (bravo!) and MacRumorsLive.  TechCrunch hammers Twitter for being down.  Again.  If, as its pundits like to think, Twitter will play a signficiant role in reporting events, it needs to work all the time.  It is, after all, a communication channel.  Moreover, it’s a communication channel under constant scrutiny.

This brings me to a point I want to make about the Multicore Crisis and The Big Switch (what Nick Carr calls the trend to move to Cloud Computing).  These two megatrends are combining to change what the important core competencies are to succeed.  Once upon a time, it was enough just to be able to lash together all the myriad pieces needed to create a web application with a good user design.  You could count on Moore’s Law to make machines faster and your customer growth was slow enough that scalability could be comfortable pushed out into the future as a high quality problem to deal with if you succeeded.  That’s no longer the case.  The ability for new ideas to catch on has become viral on the web for a variety of reasons, not the least of which is that so many more people are on the web and they’re interconnected in so many more ways than simple e-mail, search, and web browsing.

There is another, more subtle manifestation of all this.  The new MacBook Air personifies this.  In the Multicore era: user experience is the new black for hardware.  Why?  Well, in the old days, everyone wanted to upgrade every two years.  For a while, I bought a new PC every year.  And it was worth it.  The new machines were significantly faster than the old.  In a world where the upgrade cycle is so short, you want to buy cheap hardware.  Result?  Dell wins big.  They’re the best at building their hardware cheap, so you can buy it more often, so you can get that speed.  Dell was driven by the Need for Speed, and the relative ease with which Moore’s Law delivered it.

Times have changed.  In an era when you probably won’t upgrade every two years, let alone every year, it makes sense to look at something other than speed.  I have an idea, how about looking at the User Experience?  Is the machine sexier?  Does it do cool things?  I love the Air’s ability to “borrow” a disk drive via WiFi from a nearby machine as well as its ability to handle iPhone-like gestures on its touch pad.  Combining Apple’s trademark radically uber-cool Industrial Design with genuine usability innovation is a winning formula.  If it gets you to buy a new machine when you otherwise would be happy to stand pat, they win.  The fact that so much of what one does on a computer is via the Internet combined with the rise of very effective virtualization software has radically lowered the barriers to PC/Windows users buying a Mac as well.  The latter is the Big Switch component.

That’s two significant changes brought on by the Multicore Crisis and The Big Switch.  What is your company doing to get ahead of these trends before some competitor uses them to ride right over your business?

Posted in Web 2.0, data center, multicore, saas, strategy, user interface | 2 Comments »

IBM Trying to Keep Up With the Cloud Jones’s

Posted by smoothspan on January 3, 2008

Can you tell that the whole cloud computing thing is ratcheting up a few notches in intensity?  I blame Amazon, who’ve rolled out a ton of initiatives and gotten lots of traction among startups.  But we ain’t seen nothin’ yet, friends.

Already there are signs that others are feeling like the train is leaving the station.  One of the more interesting is that IBM is bringing on CouchDB’s Damien Katz to work on the project full time.  It seems to me that IBM is making this move to ensure that they have an answer for Amazon’s SimpleDB in the form of CouchDB.  Thanks to Patrick Logan for pointing this out in his own blog post.

We’re going to see this pace continue to accelerate, and we’re going to see those who want to be players jockeying to make sure that they have all the elements in their Cloud Platform Suite.  It’s still to early to tell what the exact combination of ingredients for success will be, but so far it looks like Amazon is the head chef when we see others trying to emulate what they’ve done.

Meanwhile this is fantastic news for developers and startups that want to embrace these technologies.  The danger in things like Amazon’s Web Services is that they are so unique that you become utterly dependant on them.  The more others offer the same sort of services, the more competition can work its magic and make the whole scene more vibrant, cheaper, and innovative.

Viva les Cloud Computing!

Posted in amazon, data center, grid, platforms, saas, strategy | 1 Comment »

Amazon Raises the Cloud Platform Bar Again With DevPay

Posted by smoothspan on January 1, 2008

Wow, what an exciting time to be watching the Amazon Cloud Platform evolve.  We’re just beginning to think through the recent SimpleDB announcement when Amazon launches DevPayLucid Era CEO Ken Rudin says land grabs are all about a race to the top of the mountain to plant your flag there first.  It seems like Amazon has hired a helicopter in the quest to get there first.  Google, Yahoo, and others are barely talking about their cloud platforms and here is Amazon with new developments piling up on each other.  And unlike some of the developments announced by companies like Google, this stuff is ready to go.  They’re not just talking about it.

What’s DevPay all about, anyway?  Simply put, Amazon are providing a service to automate your billing.  If you use their web services to offer a service of your own, it gives you the ability to let Amazon deal with billing for you.  It’s based off the pricing model for the rest of the Amazon Web Services like EC2 and S3, but you can use any combination of one-time charges, recurring monthly charges, and metered Amazon Web Service usage. You have total flexibility to price your applications either higher or lower than your AWS usage.  In addition, they’re promising to put everything they know about how to do e-commerce (and who knows more than Amazon?) behind making the user experience great for your customers and you.

It’s not a tremendous big step forward, but it’s useful.  It’s another brick in the wall.  There are companies out there providing SaaS infrastructure for whom billing is a big piece of their offering, so obviously it is a problem that people care about having solved.  What are the pros and cons of this particular approach?

Let’s start with the pros.  If you are going to use Amazon Web Services anyway, DevPay makes the process dead simple for you to get paid for your service.  It’s ideal for microISV’s as a way to monetize their creations.  The potential is there for interesting revenue that’s tied to usage in the classic SaaS way.

What about the cons?  Here there are many, depending on what sort of business you are in and how you want to be percieved by customers.  I break it down into two major concerns: flexibility and branding.  Let’s start with branding, which I think is the more important concern.  It’s not clear to me from the announcement how you would go about disassociating your offering from Amazon so that it becomes your stand alone brand.  You and your customers are going to have to acknowledge and accept that the offering you provide is part of the Amazon collective.  Resistance is futile.  This is the moral equivalent of not being able to accept a credit card directly, and instead having to refer customers to PayPal.  It works, but it detracts a from your “big time” image.  If having a big time stand-alone image is important for you, DevPay is a non-starter at this stage.  It’s not clear to me that Amazon would have to keep it that way for all time, but perhaps they need to protect their own image as well, and would insist on it.

Second major problem is flexibility.  Yes, Amazon says you can “use any combination of one-time charges, recurring monthly charges, and metered Amazon Web Service usage”.  That sounds flexible, but it casts your business in light of what resources it consumes on Amazon.  Suppose you want a completely different metric?  Perhaps you have another expense that is not well correlated with Amazon of some kind that has to be built in, for example.  Perhaps you need to do something completely arbitrary.  It doesn’t look to me like Amazon can facilitate that at the present.

Both of these limitations are things Amazon could choose to clean up.  So far, the impression one gets is that Amazon is just putting a pretty face on the considerable internal resources they’ve developed for their primary business and making them available.  What will be interesting is to see what happens when (and if) Amazon is prepared to add value in ways that never mattered to their core business.  Meanwhile, they’re doing a great job stealing a march on potential competition.  As a SaaS business, they should be quite sticky.  Anyone that writes for their platform will have a fair amount of work to back out and try another platform.  DevPay is another example.  It will create network lock-in by tying your customer’s business relationship in terms of billing and payment to Amazon, and in turn tying that to your use of Amazon Web Services.  For example, that same lack of flexibility might prevent you from migrating your S3 or EC2 usages to, say, Google.  There doesn’t look to be a way for you to build the Google costs into your billing in  a flexible way.

We’ll see the next 5 to 10 years be a rich period of innovation and transition to Cloud Computing Platforms.  Just as many of the original PC OS platforms disappeared (CP/M anyone?) after an initial flurry of activity, and others have changed radically in importance (it no longers matters whether you run PC or Mac does it?), so too will there be dramatic changes here.  The beneficiaries will be users as well as the platform vendors, but it’s going to take nimbleness and prescient thinking to place all your bets exactly right.  The good news is the cost of making a mistake is far less than it had been in the era of building your own datacenters!

Related Articles

To Rule the Clouds Takes Software: Why Amazon’s SimpleDB is a Huge Next Step

Coté’s Excellent Description of the Microsoft Web Rift

Posted in amazon, data center, ec2, grid, saas, strategy | 5 Comments »

Eventual Consistency Is Not That Scary

Posted by smoothspan on December 22, 2007

Amazon’s new SimpleDB offering, like many other post-modern databases such as CouchDB, offers massive scaling potential if users will accept eventual consistency.  It feels like a weighty decision.  Cast in the worst possible light, eventual consistency means the database will sometimes return the wrong answer in the interests of allowing it to keep scaling.  Gasp!  What good is a database that returns the wrong answer?  Why bother? 

Often waiting for the write answer (sorry, that inadvertant slip makes for a good pun so I’ll leave it in place) returns a different kind of wrong answer.  Specifically, it may not return an answer at all.  The system may simply appear to hang. 

How does all this come about?  Largely, it’s a function of how fast changes in the database can be propogated to the point they’re available to everyone reading from the database.  For small numbers of users (i.e. we’re not scaling at all), this is easy.  There is one copy of the data sitting in a table structure, we lock up the readers so they can’t access it whenever we change that data, and everyone always gets the right answer.  Of course, solving simple problems is always easy.  It’s solving the hard problems that lands us the big bucks.  So how do we scale that out?  When we reach a point where we are delivering that information from that one single place as fast as it can be delivered, we have no choice but to make more places to deliver from.  There are many different mechanisms for replicating the data and making it all look like one big happy (but sometimes inconsistent) database, let’s look at them.

Once again, this problem may be simpler when cast in a certain way.  The most common and easiest approach is to keep one single structure as the source of truth for writing, and then replicate out changes to many other databases for reading.  All the common database software supports this.  If your single database could handle 100 users consistently, you can imagine if those 100 users were each another database you were replication to, suddenly you could handle 100 * 100 users, or 10,000 users.  Now we’re scaling.  There are schemes to replicate the replicated and so on and so forth.  Note that in this scenario, all writing must still be done on the one single database.  This is okay, because for many problems, perhaps even the majority, readers far outnumber writers.  In fact, this works so well, that we may not even use databases for the replication.  Instead, we might consider a vast in-memory cache.  Software such as memcached does this for us quite nicely, with another order of magnitude performance boost since reading things in memory is dramatically faster than trying to read from disk.

Okay, that’s pretty cool, but is it consistent?  This will depend on how fast you can replicate the data.  If you can get every database and cache in the system up to date between consecutive read requests, you are sure to be consistent.  In fact, it just has to get done between read requests for any piece of data that changed, which is a much lower bar to hurdle.  If consistency is critical, the system may be designed to inhibit reading until changes have propogated.  It take some very clever algorithms to do this well without throwing a spanner into the works and bringing the system to its knees performance-wise. 

Still, we can get pretty far.  Suppose your database can service 100 users with reads and writes and keep it all consistent with appropriate performance.  Let’s say we replace those 100 users with 100 copies of your database to get up to 10,000 users.  It’s now going to take twice as long.  During the first half, we’re copying changes from the Mother Server to all of the children.  The second half we’re serving the answers to the readers requesting them.  Let’s say we can keep the overall time the same just by halving how many are served.  So the Mother Server talks to 50 children.  Now we can scale to 50 * 50 = 2500 users.  Not nearly as good, but still much better than not scaling at all.  We can go 3 layers deep and have Mother serve 33 children each serve 33 grand children to get to 33 * 33 * 33 = 35,937 users.  Not bad, but Google’s founders can still sleep soundly at night.  The reality is we probably can handle a lot more than 100 on our Mother Server.  Perhaps she’s good for 1000.  Now the 3-layered scheme will get us all the way to 333*333*333 = 36 million.  That starts to wake up the sound sleepers, or perhaps makes them restless.  Yet, that also means we’re using over 100,000 servers too: 1 Mothers talks to 333 children who each have 333 grandchildren.  It’s a pretty wasteful scheme.

Well, let’s bring in Eventual Consistency to reduce the waste.  Assume you are a startup CEO.  You are having a great day, because you are reading the wonderful review of your service in Techcrunch.  It seems like the IPO will be just around the corner after all that gushing does it’s inevitable work and millions suddenly find their way to your site.  Just at the peak of your bliss, the CTO walks in and says she has good news and bad news.  The bad news is the site is crashing and angry emails are pouring in.  The other bad news is that to fix it “right”, so that the data stays consistent, she needs your immediate approval to purchase 999 servers so she can set up a replicated scheme that runs 1 Mother Server (which you already own) and 999 children.  No way, you say.  What’s the good news?  With a sly smile, she tells you that if you’re willing to tolerate a little eventual consistency, your site could get by on a lot fewer servers than 999.

Suppose you are willing to have it take twice as long as normal for data to be up to date.  The readers will read just as fast, it’s just that if they’re reading something that changed, it won’t be correct until the second consecutive read or page refresh.  So, our old model that had the system able to handle 1,000 users, and replicated to 999 servers to handle 1 million users used to have to go to 3 tiers (333 * 333 * 333) to get to the next level at 36 million and still serve everything consistently and just as fast.  If we relax the “just as fast”, we can let our Mother Server handle 2,000 at half the speed to get to 2000 * 1000 = 2 million users on 3 tiers with 2000 servers instead of 100,000 servers to get to 36 million. If we run 4x slower on writes, we can get 4000*1000 = 4 million users with 4000 servers.  Eventually things will bog down and thrash, but you can see how tolerating Eventual Consistency can radically reduce your machine requirements in this simple architecture.  BTW, we all run into Eventual Consistency all the time on the web, whether or not we know it.  I use Google Reader to read blogs and WordPress to write this blog.  Any time a page refresh shows you a different result when you didn’t change anything, you may be looking at Eventual Consistency.  Even if you suspect others changed something, Google Reader still comes along frequently and says an error occured and asks me to refresh.  It’s telling me they relied on Eventual Consistency and I have an inconsistent result.

As I mention, these approaches can still be wasteful of servers because of all the data copies that are flowing around.  This leads us to wonder, “What’s the next alternative?”  Instead of just using servers to copy data to other servers, which is a prime source of the waste, we could try to employ what’s called a sharded or Federated architecture.  In this approach, there is only one copy of each piece of data, but we’re dividing up that data so that each server is only responsible for a small subset of it.  Let’s say we have a database keeping up with our inventory for a big shopping site.  It’s really important to have it be consistent so that when people buy, they know the item was in stock.  Hey, it’s a contrived example and we know we can cheat on it, but go with it.  Let’s further suppose we have 100,000 SKU’s, or different kinds of items in our inventory.  We can divide this across 100 servers by letting each server be responsible for 1,000 items.  Then we write some code that acts as the go-between with the servers.  It simply checks the query to see what you are looking for, and sends your query to the correct sub-server.  Voila, you have a sharded architecture that scales very efficiently.  Our replicated model would blow out 99 copies from the 1 server, and it could be about 50 times faster (or handle 50x the users as I use a gross 1/2 time estimate for replication time) on reads, but it was no faster at all on writes.  That wouldn’t work for our inventory problem because writes are so common during the Christmas shopping season. 

Now what are the pitfalls of sharding.  First, there is some assembly required.  Actually, there is a lot of assembly required.  It’s complicated to build such architectures.  Second, it may be very hard to load balance the shards.  Just dividing up the product inventory across 100 servers is not necessarily helpful.  You would want to use a knowledge of access patterns to divide the products so the load on each server is about the same.  If all the popular products wound up on one server, you’d have a scaling disaster.  These balances can change over time and have to be updated, which brings more complexity.  Some say you never stop fiddling with the tuning of a sharded architecture, but at least we don’t have Eventual Consistency.  Hmmm, or do we?  If you can ever get into a situation where there is more than one copy of the data and the one you are accessing is not up to date, Eventual Consistency could rear up as a design choice made by the DB owners.  In that case, they just give you the wrong answer and move on. 

How can this happen in the sharded world?  It’s all about that load balancing.  Suppose our load balancer needs to move some data to a different shard.  Suppose the startup just bought 10 more servers and wants to create 10 additional shards.  While that data is in motion, there are still users on the site.  What do we tell them?  Sometimes companies can shut down the service to keep everything consistent while changes are made.  Certainly that is  one answer, but it may annoy your users greatly.  Another answer is to tolerate Eventual Consistency while things are in motion with a promise of a return to full consistency when the shards are done rebalancing.  Here is a case where the Eventual Consistency didn’t last all that long, so maybe that’s better than the case where it happens a lot. 

Note that consistency is often in the eye of the beholder.  If we’re talking Internet users, ask yourself how much harm there would be if a page refresh delivered a different result.  In may applications, the user may even expect or welcome a different result.  An email program that suddenly shows mail after a refresh is not at all unexpected.  That the user didn’t know the mail was already on the server at the time of the first refresh doesn’t really hurt them.  There are cases where absolute consistency is very important.  Go back to the sharded database example.  It is normal to expect every single product in the inventory to have a unique id that lets us find that part.  Those ids have to be unique and consistent across all of the shards.  It is crucially important that any id changes are up to date before anything else is done or the system can get really corrupted.  So, we may create a mechanism to generate consistent ids across shards.  This adds still more architectural complexity.

There are nightmare scenarios where it becomes impossible to shard efficiently.  I will over simplify to make it easy and not necessarily correct, but I hope you will get the idea.  Suppose you’re dealing with operations that affect many different objects.  The objects are divided into shards naturally when examined individually, but the operations between the objects span many shards.  Perhaps the relationships between shards are incompatible to the extent that there is no way to shard them across machines such that every single operation doesn’t hit many shards instead of a single shard.  Hitting many shards will invalidate the sharding approach.  In times like this, we will again be tempted to opt for Eventual Consistency.  We’ll get to hitting all the shards in our sweet time, and any accesses before that update is finished will just live with inconsistent results.  Such scenarios can arise where there is no obvious good sharding algorithm, or where the relationships between the objects (perhaps its some sort of real time collaborative application where people are bouncing around touching objects unpredictably) are changing much too quickly to rebalance the shards.  One really common case of an operation hitting many shards is queries.  You can’t anticipate all queries such that any of them can be processed within a single shard unless you sharply limit the expressiveness of the query tools and languages.

I hope you come away from this discussion with some new insights:

-  Inconsistency derives from having multiple copies of the data that are not all in sync.

-  We need multiple copies to scale.  This is easiest for reads.  Scaling writes is much harder.

-  We can keep copies consistent at the expense of slowing everything down to wait for consistency.  The savings in relaxing this can be quite large.

-  We can somewhat balance that expense with increasingly complex architecture.  Sharding is more efficient than replication, but gets very complex and can still break down, for example. 

-  It’s still cheaper to allow for Eventual Consistency, and in many applications, the user experience is just as good.

Big web sites realized all this long ago.  That’s why sites like Amazon have systems like SimpleDB and Dynamo that are built from the ground up with Eventual Consistency in mind.  You need to look very carefully at your application to know what’s good or bad, and also understand what the performance envelope is for the Eventual Consistency.  Here are some thoughts from the blogosphere:

Dare Obasanjo

The documentation for the PutAttributes method has the following note

Because Amazon SimpleDB makes multiple copies of your data and uses an eventual consistency update model, an immediate GetAttributes or Query request (read) immediately after a DeleteAttributes or PutAttributes request (write) might not return the updated data.

This may or may not be a problem depending on your application. It may be OK for a del.icio.us style application if it took a few minutes before your tag updates were applied to a bookmark but the same can’t be said for an application like Twitter. What would be useful for developers would be if Amazon gave some more information around the delayed propagation such as average latency during peak and off-peak hours.

Here I think Dare’s example of Twitter suffering from Eventual Consistency is interesting.  In Twitter, we follow mico-blog postings.  What would be the impact of Eventual Consistency?  Of course it depends on the exact nature of the consistency, but lets look at our replicated reader approach.  Recall that in the Eventual Consistency version, we simply tolerate that we allow reads to come in so fast that some of the replicated read servers are not up to date.  However, they are up to date with respect to a certain point in time, just not necessarily the present.  In other words, I could read at 10:00 am and get results on one server that are up to date through 10:00 am and on another results only up to date through 9:59 am.  For Twitter, depending on which server my session is connected to, my feeds may update a little behind the times.  Is that the end of the world?  For Twitter users, if they are engaged in a real time conversation, it means the person with the delayed feed may write something that looks out of sequence to the person with the up to date feed whenever the two are in a back and forth chat.  OTOH, if Twitter degraded to that mode rather than taking longer and longer to accept input or do updates, wouldn’t that be better? 

Erik Onnen

Onnen wrote a post called “Socializing Eventual Consistency” that has two important points.  First, many developers are not used to talking about Eventual Consistency.  The knee jerk reaction is that it’s bad, not the right thing, or an unnecessary compromise for anyone but a huge player like Amazon.  It’s almost like a macho thing.  Onnen lacked the right examples and vocabulary to engage his peers when it was time to decide about it.  Hopefully all the chatter about Amazon’s SimpleDB and other massively scalable sites will get more familiarity flowing around these concepts.  I hope this article also makes it easier.

His other point is that when push comes to shove, most business users will prefer availability over consistency.  I think that is a key point.  It’s also a big takeaway from the next blog:

Werner Vogels

Amazon’s CTO posted to try to make Eventual Consistency and it’s trade offs more clear for all.  He lays a lot of good theoretical groundwork that boils down to explaining that there are tradeoffs and you can’t have it all.  This is similar to the message I’ve tried to portray above.  Eventually, you have to keep multiple copies of the data to scale.  Once that happens, it becomes harder and harder to maintain consistency and still scale.  Vogels provides a full taxonomy of concepts (i.e. Monotonic Write Consistency et al) with which to think about all this and evaluate the trade offs.  He also does a good job pointing out how often even conventional RDMS’s wind up dealing with inconsistency.  Some of the best (and least obvious to many) examples include the idea that your mechanism for backups is often not fully consistent.  The right answer for many systems is to require that writes always work, but that reads are only eventually consistent.

Conclusion

I’ve covered a lot of consistency related tradeoffs involved in database systems for large web architectures.  Rest assured, that unless you are pretty unsuccessful, you will have to deal with this stuff.  Get ahead of the curve and understand for your application what the consistency requirements will be.  Do not start out being unnecessarily consistent.  That’s a premature optimization that can bite you in many ways.  Relaxing consistency as much as possible while still delivering a good user experience can lead to radically better scaling as well as making your life simpler.  Eventual Consistency is nothing to be afraid of.  Rather, it’s a key concept and tactic to be aware of.

Personally, I would seriously look into solutions like Amazon’s Simple DB while I was at it. 

Posted in amazon, data center, enterprise software, grid, platforms, soa, software development | 4 Comments »

Green DC for Datacenters, Why Not Homes Too?

Posted by smoothspan on December 19, 2007

There are some posts up about how much greener DC power is for datacenters.  Phil Windley says a single server may have a larger carbon footprint than a 15 mpg SUV.  That was an amazing statistic for me to hear.  CNet reports there is a startup called Validus that is working on the idea of supplying DC within datacenters instead of AC to help combat the waste.

The idea is that converting AC to DC is wasteful of energy, particularly when it is done over and over again inside every digital box.  Whether we’re talking wall worts for small boxes or full-on PC power supplies (or worse) for bigger boxes, it is wasteful.  I don’t find that hard to believe.  All these power supplies get warm to the touch.  Heck, my PC supply has a big fan built right into it.  This leads me to ask:  If this is a good idea for datacenters, why isn’t it a good idea for homes too?

I wonder what percentage of the power we use in our homes would be better off DC rather than AC?  Most electronics immediately want to convert to DC.  All of our computers and home entertainment electronics are in that category.  Many things with motors are not currently in that camp, but they could be with DC motors.  I’m speaking of refrigerators, for example.  It’s not clear to me that something like an electric range or water heater really cares one way or the other.  Certainly anyone that has built a PC can see that it would be easy enough to give up the power supply.  The principle issue is simply providing cabling with the right voltages.  It’s the same cabling that comes out of your power supply today.  One issue would involve those voltages.  It’s very easy to change voltages in the AC regime with a simple transformer.  DC is more problematic.  Unfortunately, even PC’s like several different voltages to be available.  A typical PC power supply offers -12V, -5V, 0V (ground), 3.3V, 5V, and 12V.  Phew, that would be a lot of outlets on the wall!  I wonder what the chances are we could get by on fewer voltage options.

The other interesting thing is that DC is ideally suited for batteries and solar power.  As it stands, those who wish to employ solar power have to run the DC output from their solar panels into an inverter to convert it to AC.  Of course this costs energy through losses to make the transformation.  If we had more of a DC-based economy, life would be better.

So why doesn’t the power company just provide DC instead of AC?  There was an epic battle fought between Thomas Edison, a DC proponent, and Nikola Tesla, the father of many early AC inventions, that Edison ultimately lost.  It turns out to be cheaper to generate AC with rotating machinery, and cheaper to transmit it over long distances.  If we’re going to go DC, we need to do so at the “last mile”.  It remains an intriguing idea.  I’ll be on the lookout for an “appliance” power supply that will run multiple computers and save power.

Posted in data center | 5 Comments »

What if Twitter Was Built on Amazon’s Cloud?

Posted by smoothspan on December 18, 2007

There was recent bellyaching in the blogosphere again about Twitter being down.  Dave Winer grumbles, “What other basic form of communication goes down for 12 hours at a time?”  There are various comments, and in the end, apparently it was about their moving ISP’s.  Twitter themselves had this to say:

Twitter is humming along now after a late night. Our team worked earnestly into the night and morning on our largest and most complex maintenance project ever. Everything went pretty much according to plan except for one thing: an incorrect switch.

The switch in question caps traffic an unacceptable level. In order to correct this, we’ll need to get some hardware installed. Unfortunately, that means we’re not done with our datacenter move just yet. This type of work can be frustrating but it’s all towards Twitter’s highest goal: reliability.

Such moves are never easy, they always include a hitch of some kind, and the Twitter customer base is hopelessly addicted to the medium so Twitter hears about it whenever the turn the thing off for any period of time.  I look at this and for me it’s just one more reason I wouldn’t want to own a datacenter.

Suppose your service, or maybe even Twitter, was built on Amazon’s Cloud or some other Utility Computing solution.  You don’t own the servers, you are renting them.  If loads go up, you can simply rent more in direct proportion to the loads and on 10 minutes notice.  A recent High Scalability article on scaling Twitter shows they don’t really have all that many servers:

  • 1 MySQL Server (one big 8 core box) and 1 slave. Slave is read only for statistics and reporting.
  • 8 Sun X4100s.
  • 10 boxes, in other words.  Now it comes time to upgrade.  Much pain and frustration.  To do it well, and without interruption, they really need 2 complete copies of their infrastructure.  This way, they can prepare the new version and start cutting users over to it while leaving the old one running.  When everyone is over, the old system can be decommissioned.  For many startups, owning twice as much hardware as they use is just out of the question.  The more successful they become, the more expensive it becomes to entertain such a luxury.  Not so on a utility computing service like Amazon’s.  Purchase the use of twice as many servers for just how long it takes for a successful upgrade and then cut them loose afterward.

    There are detractors to the Amazon approach out there, but do we really think it would make Twitter much less reliable?  What if it made it much more reliable?

    Here’s another thought that runs rampant:  how well would Amazon’s new SimpleDB work for a service like Twitter?  It seems tailormade.  Certainly the notion of a “texty” database with up to 1024 characters per field seems like a fit.  It would be fascinating to see some of the Twitterati put up a Twitter clone on Amazon’s Web Services using SimpleDB just to see how well it works and how quickly it could be put together.  Given the platform and the requirements of the application, it seems like it would not be that hard to do the experiment.  It would certainly make for an interesting test of how well Amazon’s infrastructure really works.

    Posted in Web 2.0, data center, ec2, grid, platforms | 1 Comment »