SmoothSpan Blog

For Executives, Entrepreneurs, and other Digerati who need to know about SaaS and Web 2.0.

Archive for the ‘multicore’ Category

Salesforces Switches to Dell/Linus. What’s Next, MySQL Over Oracle?

Posted by smoothspan on July 15, 2008

Salesforce will be unplugging the last of their Sun Solaris servers from their SaaS operations this week, according to TechCrunchIT.  That’s quite a big change for Salesforce, and a bit of a PR blow for Sun.  It reflects some important operational realities that the rest of the industry and corporate IT should be watching carefully.

First, vertical scaling is hard in the multicore crisis era.  When cpus no longer get twice as fast with every Moore Cycle, scaling is harder to come by and hardware gets commoditized.  Future scaling has to come from software architecture changes.  Horizontal Scaling, in other words, not Vertical Scaling.  The multicore crisis brings us to an era of many small computers rather than fewer more powerful computers, and its up to the software guys to figure that out.

Second, for a SaaS company, the cost of service delivery is an absolutely critical factor.  Once you have software that runs well and scales horizontally on cheap commodity hardware, you’ve created a huge cost advantage for yourself.  As we speak, the cost to deliver service for the various public SaaS companies is all over the map, but Salesforce has always had one of the lowest if not the lowest cost on the map.  This allows them to either show greater profitability or reinvest the savings in faster growth.

This brings me to my other point.  How long can it be before they investigate swapping out Oracle for MySQL?  As the TechCrunchIT article mentions, Salesforce started with Oracle but there’s been no mention recently about the current status.  It would be a logical further development in reducing costs if they had chosen to eliminate or were working on eliminating the cost of Oracle licenses.  For many SaaS vendors, this is a huge piece of their Cost of Services. 

Can you build industrial grade software without Oracle?  In a word, yes.  Many highly scalable web sites have done so and lived to tell the tale.  It’s more work, but once you’ve done the work the payoff can mean huge savings.  At a prior employer we were actually quite surprised to test several Open Source DB’s and learn their performance was actually not that far off of Oracle’s.  My current employer, Helpstream, has built everything on an Open Source stack and the benefits have been enormous.

How long will it be before we’re hearing that Salesforce has dropped Oracle too?  What’s your company doing to leverage commodity hardware and Open Source databases?

Related Articles

Fellow Enterprise Irregulars Dennis Howlett, Vinnie Mirchandani, and Thomas Foydel (who first raised the point about brand comfort) make some excellent points on this subject, particularly the issue of switching off Oracle.

One key point that I have personally heard before is that SaaS vendors like to offer customers the comfort of knowing the solution runs on Oracle versus Open Source.  It’s a more conservative stance that plays well in the Enterprise.  Brand matters.  Sun is working hard on the MySQL brand, but they certainly haven’t caught Oracle yet.  As Vinnie puts it, the question is, “whether SaaS vendors benefit from at least the perception that Oracle is more “bullet proof” or do SaaS customers just want results (high uptime, performance etc) and don’t really care what the underlying  technology is – especially if the economics are more attractive?”

Dennis adds some other unique thinking.  If Salesforce wants to be acquired by Oracle then it should stick with the Oracle stack.  The only thing I’d add there is it’s pretty easy to switch from MySQL back to Oracle and much harder to do the reverse.  I think they’d be fine in an acquisition if they were simple careful not to emphasize a switch to MySQL.  They did work reasonably hard to keep the Dell switch quiet, so they may already be on that path.

The second thought Dennis has is that licensing is very complex on these things.  Here again I have to agree.  Just dealing with the legals and other aspect of a relationship with a large Enterprise vendor/technology partner is expensive for a startup.

In both cases, Vinnie’s post title fits:  Does SaaS need Oracle more than Oracle needs SaaS?

Good insights, guys!

Posted in multicore, platforms, saas | 7 Comments »

Are Custom Chips An Answer to the Multicore Crisis?

Posted by smoothspan on March 20, 2008

Stacey Higginbotham wrote an interesting piece that made me wonder.  Apparently there are lots of less than cutting edge chip fabs out there that people want to keep running.  It got me to wonder.  In an age where smaller features on chips translate only to more transistors, but not neccesarily faster transistors, is the ability to have more transistors as economically valuable?  Particularly if we can’t put them to use?  The problem is Moore’s Law these days translates to more cores, not faster clock speeds.  Nearly all the software out there can’t make use of the extra cores yet, and there is a lot of discussion about how the world may have to completely retool software to make use of lots of cores.  Meanwhile, Intel sails on with 6 core chips on the horizon, more to come, and not a very good idea what to do with them.

What if what separates the latest “good” fabs from older “obsolete” fabs is not longer that valuable?  Maybe the value is less from ever smaller feature sizes and more from new chip types?  That would shift the economics from favoring giants like Intel capable of building ever more expensive fabs to those with the IP to design more new chips on fab processes that are “good enough” for lots of interesting applications.  As big as Intel is, maybe driving faster CPU’s is not the most lucrative pasttime at this point in the technology curve.

It used to be that a general purpose CPU that constantly doubled in speed every 18 months (Moore’s Law) was the place to invest.  It was a truly general purpose device capable of running all sorts of software.  There have been special purpose chips built too for maximum performance in various areas:  graphics coprocessors and various network chips are good examples.  Suppose we can make special purpose chips for almost any purpose.  I once talked to a startup that had built a hardware search accelerator for example.  They vanished into the Government spy world never to be heard from again, but it is intriguing. 

If you could create a dirt cheap special purpose chip, what would it do?  What would be the market for it?

Before the dawn of RISC there was much interest in hardware accelerators for specific languages.  Lisp machines were one such.  I remember reading a quote from Alan Kay that modern machines don’t run dynamic languages like Lisp and Smalltalk as much faster than the old machines like the Dorado as their newfound clockspeeds would imply they should.  He hinted that radically different hardware architectures could greatly benefit such languages.  I couldn’t find more on that than this quote that says the new generation is only about 50x faster than the old machines, which is a pretty poor showing indeed.

Today we have a renaissance in the interpreted and scripting languages that are descendants of languages like Lisp and Smalltalk.  Languages like Ruby on Rails, Python, and PHP are very mainstream and might benefit.  One wonders whether even Java might benefit.  Would a chip optimized to run the virtual machines of one of these languages without regard to compatibility with the old x86 world be able to run them a lot faster?  Would a chip that runs Java 10x faster than the fastest available cores from Intel be valuable at a time when Java has stopped getting faster via Moore’s Law? 

It seems to me it would.

Posted in multicore | 1 Comment »

Google Reports iPhone Usage 50x Other Handsets; Amazon S3 Goes Down: Low Friction Has a Cost

Posted by smoothspan on February 15, 2008

As I write this post there are two articles that caught my eye.  For most, the iPhone and Amazon’s Web Services have little to do with one another, but I see a bit of a pattern here that’s interesting.

Slash Lane of Apple Insider reports that Google was shocked that is was seing 50 times more search requests coming from Apple iPhones than any other mobile handset — a revelation so astonishing that the company originally suspected it had made an error culling its own data.  It’s an amazing statistic, really.  But I can attest to hitting Google quite a lot myself whenever I’m out and about and killing time before the next meeting.  In fact, I am very pleased to have my bookmarks out on a web page rather than in my browser so I can easily access all of my favorite sites from whatever device is at hand.  The iPhone is quite a credible web browser.  I can’t wait for the 3G version and higher speeds.

Following closely on my read of the iPhone piece is Nick Carr’s article about an Amazon S3 outage.  Nothing all that earth-shattering or unexpected, just that S3 was out for several hours this morning, beginning at 7:30am EST.  The gist of the article is that while the outage was to be expected, Amazon did a poor job keeping users informed of what was going on and providing explanations after the fact.  Carr is right, of course, but business is always embarassed when things go wrong and the first (and wrong) human instinct is to be shy about details.

Why do these two go together?  I’ll give you a hint:  the tales of Facebook applications reaching millions of users in an incredibly short time also goes with the theme I’m thinking of.  That theme has to do with friction.  Friction is my word for all the factors that slow adoption.  The time needed for word of mouth, decisionmaking, purchase, installation, getting through the learning curve, and finally being a first class citizen of whatever community results is governed by the degree of friction.

One of the things the Internet does is reduce friction.  In its most extreme, friction actually reverses and becomes a propelling force.  We call that viral marketing.  Most of the innovations in this second Internet round (post-bubble) have been focused on reducing friction.  Social Networks, for example, dramatically reduce the friction of networking.  Twitter dramatically reduces the friction of blogging, right down to limiting the article length to 140 characters so you don’t have to labor over the wordsmithing.

While it’s harder, the web is also a powerful means of reducing friction for more physical things.  The iPhone and Amazon Web Services are two great examples.  In an extremely short time the iPhone has racked up 50x the usage of other competing handsets for the Internet.  The traffic to AWS in approximately the same short time now exceeds the combined traffic for all other Amazon properties.

While the web itself helped to spread the word, I think it is no coincidence that these two have a lot to do with the web and offer a lot of value back to the web.  It’s what some folks call a virtuous circle.  Look for more of these as time goes on.

Now that cost side.  These growth rates are not predictable.  Nobody would have guessed that either business would get so big so fast.  In fact, many guessed just the opposite.  Even if you did guess it could happen, it would only be a guess that it could, not that it would.  A prudent business would not invest in infrastructure built to the level and assumption that it would happen.  That means there will be painful outages from time to time.  Hopefully, the infrastructure owners will take those outages as signs that its time to double down and extend their projections of what might happen much further up and to the right.  Those that succeed in keeping hold of the Tiger by the Tail will survive and prosper.

Posted in Marketing, Web 2.0, amazon, data center, grid, multicore | 5 Comments »

Software Testing in the Multicore Cloud Computing Era With Replay Solutions

Posted by smoothspan on February 11, 2008

I had the opportunity to visit Jonathan Lindo, CEO and co-founder of Replay Solutions last week and I came away impressed.  This Hummer Winblad and Partech backed startup has some fascinating new technology to help with software testing and debugging.  I like to think of their software as a time machine for complex software.  With it, you can go back and recreate the circumstances that led to a bug, and thereby figure out what has happened.  Their software works by turning your J2EE application into a black box, and monitoring everything that comes and goes into or out of the box.  Using their proprietary algorithms, the data required to do this is actually kept very small.  So small, that the company got its start helping game companies to monitor their software using the same technology.  They’re still doing a business in that market, and you can imagine the software has to be pretty unintrusive if its not going to interfere with a game.  And so it is. 

It works this magic by tapping into and instrumenting the Java code.  This sounds a lot like what my old alma mater Pure Software did with their memory leak detection.  What’s nice about it is that no access to source code is required.  In the demo, Jonathan fired up an app server (they support Tomcat and JBoss, and soon WebLogic), lit up their instrumentation module, and from that point on just used the software being tested normally.  Of course in the demo, using the software “normally” eventually led to a crash.  It was the classic ugly Java stack dump that tells you very little about what actually happened–just the thing to annoy both the user and the developers.

Replay Solutions to the rescue.  Jonathan likes to think of it as “Tivo for Software.”  Looking at the screen one sees a screenshot of every HTML rendering to the screen.  This makes it easy to tell where in the recorded dump you are and what the user was doing at the time.  The developer can set breakpoints in their code and then use ReplayDIRECTOR (that’s what the software is called) to bring the program up to the point of failure.  This can be done over and over until the programmer has figured out what went wrong.

Sounds cool, but why is this software an essential tool for the Multicore Cloud Computing Era?  Think about it.  In the old days, reproducing bugs was hard enough.  It could take days to find the exact set of steps needed to make a bug reproducible.  And until the bug is reproducible, it’s nearly impossible to fix.  Now fast forward to the Multicore Cloud Computing Era.  You’ve got hundreds or even thousands of simultaneous users running against a hundred or more CPU’s.  There are many many processes running.  Developers recognize this as a nightmare situation, because it becomes impossible to reproduce bugs in such a world.  How would you ever get all of those users to do exactly the same thing twice?  Add to that all the other crazy timing-related issues and it’s darned near impossible to track down many kinds of bugs on such software.

I talked over with Jonathan what I thought was a really cool scenario.  Would it be possible to set up ReplayDIRECTOR to continuously monitor a big SaaS or Web 2.0 system?  The answer, surprisingly, is that it is completely possible.  Suddenly, we can make these kinds of bugs reproducible.  But it gets even better.  ReplayDIRECTOR will reproduce the problem on far less hardware than the original system.  That’s another big issue to be faced with such systems–the cost of providing a duplicate environment for testing.  With Replay, the “black box” can be just the J2EE server.  All of the other pieces are simulated.

If I were currently involved with a J2EE-architecture piece of Enterprise Software, I would definitely be trying to get into Replay’s Beta Testing program.

Posted in Web 2.0, multicore, saas | Leave a Comment »

Apple, MacWorld, User Experience, and the Multicore Crisis

Posted by smoothspan on January 16, 2008

Looking over the parachute drops of information from MacWorld, I was struck by some underlying themes.  I won’t bore you with a recitation of the huge amount of surface level activity: plenty of better more firsthand places to get that.  But some of those first hand sources excited some patterns I’m familiar with.

First, the multicore crisis bit.  I’ve written about it before, but let me recap.  What is the multicore crisis?  It is a wave of change that is being unleashed by virtue of the fact that microprocessors have stopped getting faster every 18 months.  Instead of gaining a faster clock speed with free benefits for all at scarcely any effort, we get more cores.  That ain’t bad, but it takes considerable effort at the software end to take advantage of the additional cores.  For the most part, we are far from keeping up with the availability of those cores.  For emphasis, here is a graph of Intel clock speeds that vividly shows just how long the curve has been flattened out:

Clock Speed Timeline

We’ve had another year in 2007 while the curve remained flat.

What does this have to do with Apple and MacWorld?  Well, on a simple vein, it was the multicore crisis checking in that caused Mathew Ingram to write, “Hey, Steve–you broke the Internet.”  He was remarking about how Twitter was virtually unusable for hours.  Twitter has become somewhat of an unwilling canary in the coal mine: if something is hot and getting traffic, Twitter seems bound to go down.  Why?  Because it is a victim of the Multicore Crisis.  The system’s architecture isn’t scaling.  It may be a software problem, i.e. it is not designed to take advantage of enough cpu’s, or an infrastructure problem, i.e. it can only take advantage of the cpus Twitter has physically bought and installed in their data center.  These can both be overcome.  Software can be made to take advantage of lots more processors.  Services like Amazon and others offer let you scale up to many more cpu’s on short notice without having to buy physical hardware.  Failure to provide for both these contingencies is succumbing to the Multicore Crisis.

Twitter was not unique.  Mathew’s blog was very slow to come up when I tried to access the article, having been Techmemed.  He mentions Fake Steve Jobs got creamed and couldn’t make CoverIt Live work (Zoli mentions CoveritLive was CoveritDead).  The Apple store was down at one point too.

Scoble tells a similar story:  Engadget was up but very slow, Qik’s macworld channel was up and down, Mogulus was slow to unreachable.  Live video was hard to come by.  TUAW fairly unreachable.  There were a couple sites that passed muster including TechCrunch (bravo!) and MacRumorsLive.  TechCrunch hammers Twitter for being down.  Again.  If, as its pundits like to think, Twitter will play a signficiant role in reporting events, it needs to work all the time.  It is, after all, a communication channel.  Moreover, it’s a communication channel under constant scrutiny.

This brings me to a point I want to make about the Multicore Crisis and The Big Switch (what Nick Carr calls the trend to move to Cloud Computing).  These two megatrends are combining to change what the important core competencies are to succeed.  Once upon a time, it was enough just to be able to lash together all the myriad pieces needed to create a web application with a good user design.  You could count on Moore’s Law to make machines faster and your customer growth was slow enough that scalability could be comfortable pushed out into the future as a high quality problem to deal with if you succeeded.  That’s no longer the case.  The ability for new ideas to catch on has become viral on the web for a variety of reasons, not the least of which is that so many more people are on the web and they’re interconnected in so many more ways than simple e-mail, search, and web browsing.

There is another, more subtle manifestation of all this.  The new MacBook Air personifies this.  In the Multicore era: user experience is the new black for hardware.  Why?  Well, in the old days, everyone wanted to upgrade every two years.  For a while, I bought a new PC every year.  And it was worth it.  The new machines were significantly faster than the old.  In a world where the upgrade cycle is so short, you want to buy cheap hardware.  Result?  Dell wins big.  They’re the best at building their hardware cheap, so you can buy it more often, so you can get that speed.  Dell was driven by the Need for Speed, and the relative ease with which Moore’s Law delivered it.

Times have changed.  In an era when you probably won’t upgrade every two years, let alone every year, it makes sense to look at something other than speed.  I have an idea, how about looking at the User Experience?  Is the machine sexier?  Does it do cool things?  I love the Air’s ability to “borrow” a disk drive via WiFi from a nearby machine as well as its ability to handle iPhone-like gestures on its touch pad.  Combining Apple’s trademark radically uber-cool Industrial Design with genuine usability innovation is a winning formula.  If it gets you to buy a new machine when you otherwise would be happy to stand pat, they win.  The fact that so much of what one does on a computer is via the Internet combined with the rise of very effective virtualization software has radically lowered the barriers to PC/Windows users buying a Mac as well.  The latter is the Big Switch component.

That’s two significant changes brought on by the Multicore Crisis and The Big Switch.  What is your company doing to get ahead of these trends before some competitor uses them to ride right over your business?

Posted in Web 2.0, data center, multicore, saas, strategy, user interface | 3 Comments »

Scalability is a Requirement for Startups

Posted by smoothspan on December 6, 2007

Dharmesh Shah wonders whether startups should ignore scalability:

You’re worrying about scalability too early. Don’t blow your limited resources on preparing for success. Instead, spend them on increasing the chances that you’ll actually succeed.

It’s an interesting question: should startups worry about scalability, or does that get in the way of finding a proper product/market fit?  If you’ve read my blog much you’ll know that I view achieving that product/market fit as the highest priority for a startup, and I’m not alone, Marc Andreesen says it too.  I think this is so important that I have advocated some relatively radical architectural ramifications to help facilitate the flexibility of a product so it can evolve towards that ideal even faster.

But where does scalability fit in?  Can you achieve that product/market fit without it?  For most startups, I think it is either difficult to verify a true product/market fit without it, or worse, you may achieve it only to immediately fall to earth a victim of poor user experience.  There are certainly plenty of examples of companies that started out great, seemed to have that product/market fit, but got into persistent hot water because they couldn’t scale out a good user experience when their site began to take off.  Fred Wilson writes recently about his love/hate relationship with Technorati, which has been a good example of this.

Here is another question, “How much success do  you need to verify product/market fit?”  Signing up a few customers to a beta, or even having a large beta is not really enough in my opinion.  It’s pretty easy to get a ton of people to try something that sounds sexy and is promoted well.  The question is whether that really takes well enough.  Marc Andreesen’s Ning is a good example.  When they launched their original product it required a fair amount of custom programming to create a custom Social Network.  They had 30,000 social networks created even so, but the service wasn’t taking off.  Michael Arrington was calling it R.I.P.  Then they released a version that eliminated the need for programming and suddenly the product/market fit was there and it took off like a rocket, crossing 100,000 social networks in record time.  Clearly Ning had to deal with scalability before they could learn much about their product/market fit.

Google is another great example of this.  They had to scale from day one because of the problem they were solving.  Om Malik says their infrastructure and ability to scale is actually their strategic advantage.  Certainly the nature of the problem Google wanted to solve required scalability from day one.  This is how Aloof Schipperke wants to view the question when he says, “Scalability is a requirement, not an optimization.”  It’s a bit of a double entendre.  One could say it is a requirement that all startups deal with it, or one could say startups need to evaluate whether scalability is a requirement in their domain.  I’m in that latter camp.  Figure out what success really looks like.  When do you know you have product/market fit?  Be conservative.  What are the requirements to get there?  Aloof lumps scalability in with other “ilities”.  Can your startup reach product/market fit without security, for example?  The answers may surprise you if you’re really honest about it.

Chances are, you may have to do more to be sure about product/market fit than you are comfortable with in release 1.0.  You’ll need a phased plan for how to get there.  Lest you use this as an excuse to ignore scalability until the last minute, keep in mind that these phased plans should have short milestones.  Quarterly or six month iterations at most.  Scaling a really poorly architected application can amount to a painful rewrite.  So do a phasing plan for scaling.  What are the big ticket items you’ll need to enable early so that scaling later is not too hard?  There are a few well-known touchpoints that can make scalability easier.  I’m not going to go over all of them, you know what I’m talking about:  statelessness, RESTful web services, and beware the database.  If you don’t know about these things, get some people on your team who do!  It’s not hard to start out with a plan in mind about your eventual scalablity and just make sure that along the way you don’t inadvertently shoot yourself in the foot.  It usually boils down to securing the two ends of the puzzle with good scalability:

- How will the client side web servers scale?

- How will the database back end scale?

Make a plan for what it will look like when it’s done, and put phased milestones in place to get there over time. 

Here’s another key issue.  Dharmesh’s original question assumes scalability and user/experience compete for scarce resources.  Ed Sim somewhat follows this path too when he writes that it’s hard to sell Scalability.  Aren’t we talking about the tradeoffs between UI/Features and Infrastructure (web or DB)?  Are the same engineers really doing both things?  It seems to me a lot more common to have a “front end” or application group and a “back end” or infrastructure group, even if “group” is a bit grandiose for a couple of people.  Take the opportunity to map out how the modules produced by these two groups will communicate.  Make that communication architecturally clean so the groups are decoupled.  Make the communication work the way it will when you build out scalability, but then don’t build it out at first.  This will enable the infrastructure group’s agenda to decouple from the user experience guys. 

BTW, if you’re thinking the true competition between the two is you want to hire all user experience people with your capital and no infrastructure, that just sounds like a bad idea to me.  It’s hard to deliver good user experience if your infrastructure is lousy, buggy, and doesn’t perform.  There are ample studies that show the speed with which your application serves up pages is a big contributor to user experience as well. 

I’ve gone down this path before of having essentially two small teams and making sure there was clean communication between their code from the start.  My company PriceRadar was lucky enough to land a partnership with AskJeeves early on.  Part of the deal was we had to pass a load test that showed we could handle 10,000 simultaneous users hammering our application.  At the time, most of my experience and developers were from the Microsoft world, so we were .NET all the way.  I remember meeting with the advisory board for a company called iSharp.  It was an all-star cast of web application CTO’s and VP’s of Engineering.  We went around the table to hear what everyone was doing.  I was the only Microsoft guy in the room, and the Unix crowd just laughed when I told them we had to pass this big load test.  AskJeeve’s CTO was there as well as the fellow in charge of AOL Instant Messenger and about 10 others.  They flat said it was impossible on Unix.  In less than a month we had it all working with a distributed grid architecture.  The front end guys were never even involved and changed little or no code.  The back end guys didn’t sleep much, but they emerged triumphant.  And the entire team was about 10 developers, per my small team mentality.

Yes Virginia, you should worry about your scalability, but it need not be all consuming.  You can handle it.

Posted in Web 2.0, data center, grid, multicore, platforms, saas, strategy | 2 Comments »

A Pile of Lamps Needs a Brain

Posted by smoothspan on October 28, 2007

Continuing the discussion of a Pile of Lamps (a clustered Lamp stack in more prosaic terms), Aloof Schipperke writes about how such a thing might manage its consumption of machines on a utility computing fabric:

Techniques for managing large sets of machines tend to either highly centralized or highly decentralized. Centralized solutions tend to come from system administration circles as ways to cope with large quantities of machines. Decentralized solutions tend to come from the parallel computing space where algorithms are designed to take advantage of large quantities of machines.

Neither approach tends to provide much coupling between management actions and application conditions. Neither approach seems well adapted for any form of semi-intelligent dynamic configuration of multi-layer web application. Neither of them seem well suited for non-trivial quantities of loosely coupled LAMP stacks.

Aloof has been contemplating whether a better approach might be to have the machines converse amongst themselves in some way.  He envisions machines getting together when loads become too challenging and deciding to spawn another machine to take some of the load on.

Let’s drop back and consider this more generally.  First, we have a unique capability emerging in hosted utility grids.  These range from systems like Amazon’s Web Services to 3Tera’s ability to create grids at their hosting partners.  It started with the grid computing movement which sought to use “spare” computers on demand, and has now become a full blown commercially available service.  Applications can order and provision a new server literally on 10 minutes notice, use it for a period of time, and then release the machine back to the pool only paying for the time they’ve used.  This differs markedly from stories such as iLike’s, who had to drive around in a truck borrowing servers everywhere they could, and then physically connect them up.  Imagine how much easier it could have been to push a button and bring on the extra servers on 10 minutes notice as they were needed.

Second, we have the problem of how to manage such a system.  This is Aloof’s problem.  Just because we can provision a new machine on 10 minutes notice doesn’t mean a lot of other things:

  • It doesn’t mean our application is architected to take advantage of another machine. 
  • It doesn’t mean we can reconfigure our application to take advantage in 10 minutes.
  • It doesn’t mean we have a system in place that knows when it’s time to add a machine, or take one back off.

This requires another generation of thinking beyond what’s typically been implemented.  New variable cost infrastructure has to trickle down into fixed cost architectures.  For me, this sort of problem always boils down to finding the right granularity of “object” to think about.  Is the machine the object?  Whether or not it is, our software layers must take account of machines as objects because that’s how we pay for them.

So to attack this problem, we need to understand a collection of questions:

  1. What is to be our unit of scalability?  A machine?  A process?  A thread?  A component of some kind?  At some level, the unit has to map to a machine so we can properly allocate on a utility grid.
  2. How do we allocate activity to our scalability units?  Examples include load balancing and database partitioning.  Abstractly, we need some hashing function that selects the scalability unit to allocate work (data, compute crunching, web page serving, etc.) to.
  3. What is the mechanism to rebalance?  When a scalability unit reaches saturation by some measure, we must rebalance the system.  We change the hashing function in #2 and we have a mechanism to redistribute without losing anything while the process is happening.  We also must understand how we measure saturation or load for our particular domain.

Let’s cast this back to the world of a Pile of Lamps.  A traditional Lamp stack scaling effort is going to view each component of the stack separately.  The web piece is separate from the data piece, so we have different answers for the 3 issues on each of the 2 tiers.  Pile of Lamps changes how we factor the problem.  If I understand the concept correctly, instead of independently scaling the two tiers, we will simply add more Lamp clusters, each of which is a quasi-independent system.

This means we have to add a #4 to the first 3.  It was implicit anyway:

    4.  How do the scaling units communicate when the resources needed to finish some work are not all present within the scaling unit?

Let’s say we’re using a Pile of Lamps to create a service like Twitter.  As long as the folks I’m following are on the same scaling unit as me, life is good.  But eventually, I will follow someone on another scaling unit.  If the Pile of Lamps is clever, it makes this transparent in some way.  If it can do that, the other three issues are at least things we can go about doing behind the scenes without bothering developers to handle it in their code.  If not, we’ll have to build a layer into our application code that makes it transparent for most of the rest of the code.

I think Aloof’s musings about whether #3 can be done as conversations between the machines will be clearer if the Pile of Lamps idea is mapped out more fulling in terms of all 4 questions.

Posted in Web 2.0, grid, multicore, platforms, strategy | 1 Comment »

Pile O’ LAMPs: What Would Fielding Say?

Posted by smoothspan on October 21, 2007

I’ve been pondering the Pile O’ Lamps concept that I first read about in Aloof Architecture and Process Perfection.  Read the posts yourself for the horse’s mouth, but to me, the Pile O’ Lamps concept is basically asking whether a computing grid of LAMP stacks is a worthwhile architectural construct that could be highly reusable for a variety of applications.  I say grid, because in my mind, it achieves maximal potential if deployed flexibly on a utility computing fabric such as Amazon EC2 where it can automatically flex to a larger cluster based on load requirements.  If it is fixed in size by configuration (which still means changeable, just not as quickly and automatically), I guess it would be more proper to call it a LAMP cluster.

LAMP refers to Linux as the OS, Apache as the web server, mySQL as the database, and a “P” language (usually PHP or Python) as the langauge used to implement the application.  It has become almost ubiquitious as a superfast way to bring up a new web application.  There are some shortcomings, but by and large, it remains one of the simplest ways to get the job done and still have the thing continue to work if you move into the big time.  A Pile of Lamps architecture would presumably simplify scaling by building it in at the outset rather than trying to tack it on later.

In general, I love the idea.  People are effectively doing what it calls for all the time anyway, they just do so in an ad hoc manner.  I got ambitious this Sunday morning and thought I’d drag out Fielding’s Dissertation and see how the idea stacks up.  If you’ve never had a look at Roy Fielding’s Architectural Styles and the Design of Network-Based Software Architectures, you missed out on a beautiful piece of work from the man that co-designed the Internet protocols.  This particular document sets forth the REST (Representational State Transfer) architecture.  What’s cool about it is that Fielding has a framework that he uses to evaluate the various components of REST that is applicable to a lot of other network architecture problems.  See Chapter 3 of the Dissertation for details, but that is my favorite part of the document. 

His concept is to create a scorecard for various network architectural components, and then use that scorecard together with the domain requirements of the design problem to arrive at an optimal architecture.  He says that’s how he got to REST, and it certainly seems to make sense as you read the Dissertation.  Here is a rendition of his ranking criteria for the models he considers:

Fielding Framework

A “0″ means the architectural style is beneficial to some domains and not others.  Positive means the style has benefit and negative means it is a poorer choice.

The components that make up REST look like this:

RESTful according to Fielding

There are 3 components that go into it:

  • Layered Cached Stateless Client Server:  The row marked LCS+C$SS
  • Uniform Interface, which isn’t in the original Fielding taxonomy, but which he says adds the qualities listed.
  • Code on Demand:  This is the ability of the web to send code to the client for execution based on what it requests.  So, for example, Flash or AJAX.

The “RESTful Result” is simply the total of the other attributes.  You can see it hits pretty darned well on most of the categories with the exception of Network efficiency.  As noted, this primarily means it isn’t suited to extremely fine grained communication, but is fine for a web page.  Pretty cool framework, eh?

Incidentally, Fielding’s framework really dumps on CORBA for all the right reasons.  Give it a read to see why.

Now let’s look at the Pile of Lamps.  Note that we aren’t trying to compare it to REST–they solve different problems.  Fielding tells us to do the analysis based on our domain, so put aside the RESTful scores, they aren’t meaningful to compare to anything but REST competitors.  Here is the result for Pile of Lamps:

Pile of Lamps

I view the LAMP stack as Layered Client Server, which is already a decent protocol.  A Pile of Lamps seems to me is basically adding a cached and replicated capability to the LAMP stack, so I add the cached/replicated repository to the equation.  You can see that it amplifies the LAMP stack while taking nothing away.  Basically, it makes it more efficient, more scalable, and it delivers those benefits in a simple way.  This makes total sense to me, given the concept. 

One can use the framework to fiddle with other potential additions to the Pile of Lamps idea.  For example, what if statelessness were pervasive in this paradigm?  I leave further refinement of the idea to readers, commenters, and the original authors, but it looks promising to me.  I’d also encourage others to delve into Fielding’s work.  It has application well beyond just describing REST.

Related Articles

A Pile of Lamps Needs a Brain

Posted in Open Source, Web 2.0, grid, multicore, platforms, software development, strategy | 3 Comments »

Amazon Beefs Up EC2 With New Options

Posted by smoothspan on October 16, 2007

I’ve been a big fan of Amazon’s Web Services for quite a while and attended their Startup Project, which is an afternoon seeing what it can do and hearing from entrepreneurs who’ve built on this utility computing fabric.  Read my writeup on the Startup Project for more.  Amazon has been steadily rolling out improvements, such as the addition of SLA’s for the S3 storage service.  Today, there is big news in the Amazon EC2 camp:

Amazon has just announced two new instance types for their EC2 utility computing service.  The original type will continue to be available as the “small” type.  The “large” type has four times the CPU, RAM, and Disk Storage, while the “extra large” has eight times the CPU, RAM, and Disk.  The large and extra large also sport 64 bit cpus.  Supersize your EC2!

Why do this?  Because the original small instance was a tad lightweight for database activity with just 1.7GB of RAM while the extra large at 15GB is about right.  Imagine a cluster of the extra large instances running memcached and you can see how this going to dramatically improve the possibilities for hosting large sites.

One of the neat things about this new announcement is pricing.  They’ve basically linearly scaled pricing.  Whereas a small instance costs 10 cents per instance hour, the extra large has 8x the capacity and costs 8×10 cents or 80 cents per hour.

What’s next?  These new instances open a lot of possibilities, but Amazon still doesn’t have painless persistence for databases like mySQL.  If you are running mySQL on an extra large instance and the server goes down for whatever reason, all the data on it is lost and you have to rebuild a new machine around some form of hot backup or failover.  That exercise has been left to the user.  It’s doable: you have to solve the problem in any data center of what you plan to do if the disk totally crashes and no data can be recovered.  However, folks have been vocally requesting a better solution from Amazon where the data doesn’t go away and the machine can be rebooted intact.  I was told by the EC2 folks at the Startup Project to expect 3 announcements before the end of the year that were related.  I’m guessing this is the first such announcement and two more will follow. 

There’s tremendous excitement right now around these kinds of offerings.  They virtualize the data center to reduce the cost and complexity of setting up the infrastructure to do web software.  They allow you to flex capacity up or down and pay as you go.  Amazon is not the only such option.  I’ll be reporting on some others shortly.  It’s hard to see how it makes sense to build your own data center without the aid of one of these services any more. 

Posted in Web 2.0, amazon, ec2, grid, multicore, platforms, saas, software development | 2 Comments »

To Escape the Multicore Crisis, Go Out Not Up

Posted by smoothspan on September 29, 2007

Of course, you should never go up in a burning building, go out instead.  Amazon’s Werner Voegels sees the Multicore Crisis in much the same way:

Only focusing on 50X just gives you faster Elephants, not the revolutionary new breeds of animals that can serve us better.

Voegels is writing there about Michael Stonebreaker’s claims that he can demonstrate a database architecture that outperforms conventional databases by a factor of 50X.  Stonebreaker is no one to take lightly: he’s accomplished a lot of innovation in his career so far and he isn’t nearly done.  He advocates replacing the Oracle (and mySQL) style databases (which he calls legacy databases) with a collection of special purpose databases that are optimized for particular tasks such as OLTP or data warehousing.  It’s not unlike the concept myself and others have talked about that suggests that the one-language-fits-all paradigm is all wrong and you’d do better to adopt polyglot programming.

I like Stonebreaker’s work.  While I want the ability to scale out to any level that Voegels suggests, I will take the 50X improvement as a basic building block and then scale that out if I can.  That’s a significant scaling factor even looked at in the terms of the Multicore Language Timetable.  It’s nearly 8 years of Moore’s Cycles.  I’m also mindful that databases are the doorway to the I/O side of the equation which is often a lot harder to scale out.  Backing an engine that’s 50X faster sucking the bits off the disk with memcached ought to lead to some pretty amazing performance.

But Voegels is right, in the long term we need to see different beasts than the elephants.  It was with that thought in mind that I’ve been reading with interest articles about Sequoia, an open source database clustering technology that makes a collection of database servers look like one more powerful server.  It can be used to increase performance and reliablity.  It’s worth noting that Sequoia can be installed for any Java app using JDBC without modifying the app.  Their clever monicker for their technology is RAIDb:  Redundant Array of Inexpensive Databases.  There are different levels of RAIDb just as there are RAID levels that allow for partitioning, mirroring, and replication.  The choice of level or combinations of levels governs whether your applications gets more performance, more reliability, or both.

Sequoia is not a panacea, but for some types of benchmarks such as TPC-W, it shows a nearly linear speedup as more cpus are added.  It seems likely a combination of approaches such as Stonebreaker’s specialized databases for particular niches and clustering approaches like Sequoia all running on a utility computing fabric such as Amazon’s EC2 will finally break the multicore logjam for databases.

Posted in Open Source, amazon, ec2, grid, multicore, platforms, software development | 4 Comments »