A Pile of Lamps Needs a Brain
Posted by Bob Warfield on October 28, 2007
Continuing the discussion of a Pile of Lamps (a clustered Lamp stack in more prosaic terms), Aloof Schipperke writes about how such a thing might manage its consumption of machines on a utility computing fabric:
Techniques for managing large sets of machines tend to either highly centralized or highly decentralized. Centralized solutions tend to come from system administration circles as ways to cope with large quantities of machines. Decentralized solutions tend to come from the parallel computing space where algorithms are designed to take advantage of large quantities of machines.
Neither approach tends to provide much coupling between management actions and application conditions. Neither approach seems well adapted for any form of semi-intelligent dynamic configuration of multi-layer web application. Neither of them seem well suited for non-trivial quantities of loosely coupled LAMP stacks.
Aloof has been contemplating whether a better approach might be to have the machines converse amongst themselves in some way. He envisions machines getting together when loads become too challenging and deciding to spawn another machine to take some of the load on.
Let’s drop back and consider this more generally. First, we have a unique capability emerging in hosted utility grids. These range from systems like Amazon’s Web Services to 3Tera’s ability to create grids at their hosting partners. It started with the grid computing movement which sought to use “spare” computers on demand, and has now become a full blown commercially available service. Applications can order and provision a new server literally on 10 minutes notice, use it for a period of time, and then release the machine back to the pool only paying for the time they’ve used. This differs markedly from stories such as iLike’s, who had to drive around in a truck borrowing servers everywhere they could, and then physically connect them up. Imagine how much easier it could have been to push a button and bring on the extra servers on 10 minutes notice as they were needed.
Second, we have the problem of how to manage such a system. This is Aloof’s problem. Just because we can provision a new machine on 10 minutes notice doesn’t mean a lot of other things:
- It doesn’t mean our application is architected to take advantage of another machine.
- It doesn’t mean we can reconfigure our application to take advantage in 10 minutes.
- It doesn’t mean we have a system in place that knows when it’s time to add a machine, or take one back off.
This requires another generation of thinking beyond what’s typically been implemented. New variable cost infrastructure has to trickle down into fixed cost architectures. For me, this sort of problem always boils down to finding the right granularity of “object” to think about. Is the machine the object? Whether or not it is, our software layers must take account of machines as objects because that’s how we pay for them.
So to attack this problem, we need to understand a collection of questions:
- What is to be our unit of scalability? A machine? A process? A thread? A component of some kind? At some level, the unit has to map to a machine so we can properly allocate on a utility grid.
- How do we allocate activity to our scalability units? Examples include load balancing and database partitioning. Abstractly, we need some hashing function that selects the scalability unit to allocate work (data, compute crunching, web page serving, etc.) to.
- What is the mechanism to rebalance? When a scalability unit reaches saturation by some measure, we must rebalance the system. We change the hashing function in #2 and we have a mechanism to redistribute without losing anything while the process is happening. We also must understand how we measure saturation or load for our particular domain.
Let’s cast this back to the world of a Pile of Lamps. A traditional Lamp stack scaling effort is going to view each component of the stack separately. The web piece is separate from the data piece, so we have different answers for the 3 issues on each of the 2 tiers. Pile of Lamps changes how we factor the problem. If I understand the concept correctly, instead of independently scaling the two tiers, we will simply add more Lamp clusters, each of which is a quasi-independent system.
This means we have to add a #4 to the first 3. It was implicit anyway:
4. How do the scaling units communicate when the resources needed to finish some work are not all present within the scaling unit?
Let’s say we’re using a Pile of Lamps to create a service like Twitter. As long as the folks I’m following are on the same scaling unit as me, life is good. But eventually, I will follow someone on another scaling unit. If the Pile of Lamps is clever, it makes this transparent in some way. If it can do that, the other three issues are at least things we can go about doing behind the scenes without bothering developers to handle it in their code. If not, we’ll have to build a layer into our application code that makes it transparent for most of the rest of the code.
I think Aloof’s musings about whether #3 can be done as conversations between the machines will be clearer if the Pile of Lamps idea is mapped out more fulling in terms of all 4 questions.