Software Testing in the Multicore Cloud Computing Era With Replay Solutions
Posted by Bob Warfield on February 11, 2008
I had the opportunity to visit Jonathan Lindo, CEO and co-founder of Replay Solutions last week and I came away impressed. This Hummer Winblad and Partech backed startup has some fascinating new technology to help with software testing and debugging. I like to think of their software as a time machine for complex software. With it, you can go back and recreate the circumstances that led to a bug, and thereby figure out what has happened. Their software works by turning your J2EE application into a black box, and monitoring everything that comes and goes into or out of the box. Using their proprietary algorithms, the data required to do this is actually kept very small. So small, that the company got its start helping game companies to monitor their software using the same technology. They’re still doing a business in that market, and you can imagine the software has to be pretty unintrusive if its not going to interfere with a game. And so it is.
It works this magic by tapping into and instrumenting the Java code. This sounds a lot like what my old alma mater Pure Software did with their memory leak detection. What’s nice about it is that no access to source code is required. In the demo, Jonathan fired up an app server (they support Tomcat and JBoss, and soon WebLogic), lit up their instrumentation module, and from that point on just used the software being tested normally. Of course in the demo, using the software “normally” eventually led to a crash. It was the classic ugly Java stack dump that tells you very little about what actually happened–just the thing to annoy both the user and the developers.
Replay Solutions to the rescue. Jonathan likes to think of it as “Tivo for Software.” Looking at the screen one sees a screenshot of every HTML rendering to the screen. This makes it easy to tell where in the recorded dump you are and what the user was doing at the time. The developer can set breakpoints in their code and then use ReplayDIRECTOR (that’s what the software is called) to bring the program up to the point of failure. This can be done over and over until the programmer has figured out what went wrong.
Sounds cool, but why is this software an essential tool for the Multicore Cloud Computing Era? Think about it. In the old days, reproducing bugs was hard enough. It could take days to find the exact set of steps needed to make a bug reproducible. And until the bug is reproducible, it’s nearly impossible to fix. Now fast forward to the Multicore Cloud Computing Era. You’ve got hundreds or even thousands of simultaneous users running against a hundred or more CPU’s. There are many many processes running. Developers recognize this as a nightmare situation, because it becomes impossible to reproduce bugs in such a world. How would you ever get all of those users to do exactly the same thing twice? Add to that all the other crazy timing-related issues and it’s darned near impossible to track down many kinds of bugs on such software.
I talked over with Jonathan what I thought was a really cool scenario. Would it be possible to set up ReplayDIRECTOR to continuously monitor a big SaaS or Web 2.0 system? The answer, surprisingly, is that it is completely possible. Suddenly, we can make these kinds of bugs reproducible. But it gets even better. ReplayDIRECTOR will reproduce the problem on far less hardware than the original system. That’s another big issue to be faced with such systems–the cost of providing a duplicate environment for testing. With Replay, the “black box” can be just the J2EE server. All of the other pieces are simulated.
If I were currently involved with a J2EE-architecture piece of Enterprise Software, I would definitely be trying to get into Replay’s Beta Testing program.