Big Data is all the rage, and seem to be one of the prime targets for new entrepreneurial ventures since VC-dom started to move from Consumer Internet to Enterprise recently. Yet, I remain skeptical about Big Data for a variety of reasons. As I’ve noted before, it seems to be a premature optimization for most companies. That post angered the Digerati who are quite taken with their NoSQL shiny objects, but there have been others since who reach much the same conclusion. The truth is, Moore’s Law scales faster than most organizations can scale their creation of data. Yes, there are some few out of millions of companies that are large enough to really need Big Data and yes, it is so fashionable right now that many who don’t need it will be talking about it and using it just so they can be part of the new new thing. But they’re risking the problems many have had when they adopt the new new thing for fashion rather than because it solves real problems they have.
This post is not really about Big Data, other than to point out that I think it is a relatively small market in the end. It’ll go the way of Object Oriented Databases by launching some helpful new ideas, the best of which will be adopted by the entrenched vendors before the OODB companies can reach interesting scales. So it will be with Hadoop, NoSQL, and the rest of the Big Data Mafia. For those who want to get a head start on the next wave, and on a wave that is destined to be much more horizontal, much larger, and of much greater appeal, I offer the notion of Suburban Data.
While I shudder at the thought of any new buzzwords, Suburban Data is what I’ve come up with when thinking about the problem of massively parallel architectures that are so loosely coupled (or perhaps not coupled at all) that they don’t need to deal with many of the hard consistency problems of Big Data. They don’t care because what they are is architectures optimized to create a Suburb of very loosely coordinated and relatively small collections of data. Think of Big Data’s problems as being those of the inner city where there is tremendous congestion, real estate is extremely expensive, and it makes sense to build up, not out. Think Manhattan. It’s very sexy and a wonderful place to visit, but a lot of us wouldn’t want to live there. Suburban Data, on the other hand, is all about the suburbs. Instead of building giant apartment buildings where everyone is in very close proximity, Suburban Data is about maximizing the potential of detached single family dwellings. It’s decentralized and there is no need for excruciatingly difficult parallel algorithms to ration scarce services and enforce consistency across terabytes.
Let’s consider a few Real World application examples.
WordPress.com is a great place to start. It consists of many instances of WordPress blogs. Anyone who likes can get one for free. I have several, including this Smoothspan Blog. Most of the functionality offered by wp.com does not have to coordinate between individual blogs. Rather, it’s all about administering a very large number of blogs that individually have very modest requirements on the power of the underlying architecture. Yes, there are some features that are coordinated, but the vast majority of functionality, and the functionality I tend to use, is not. If you can see the WordPress.com example, web site hosting services are another obvious example. They just want to give out instances as cheaply as possible. Every blog or website is its own single family home.
There are a lot of examples along these lines in the Internet world. Any offering where the need to communicate and coordinate between different tenants is minimized is a good candidate. Another huge area of opportunity for Suburban Data are SaaS companies of all kinds. Unless a SaaS company is exclusively focused on extremely large customers, the requirements of an average SaaS instance in the multi-tenant architecture are modest. What customers want is precisely the detached single family dwelling, at least that’s what they want from a User Experience perspective. Given that SaaS is the new way of the world, and even a solo bootstrapper can create a successful SaaS offering, this is truly a huge market. The potential here is staggering, because this is the commodity market.
Look at the major paradigm shifts that have come before and most have amounted to a very similar (metaphorically) transition. We went from huge centralized mainframes to mini-computers. We went from mini-computers to PC’s. Many argue we’re in the midst of going from PC’s to Mobile. Suburban Data is all about how to create architectures that are optimal for creating Suburbs of users.
What might such architectures look like?
First, I think it is safe to say that while existing technologies such as virtualization and the increasing number of server hardware architectures being optimized for data center use (Facebook and Google have proprietary hardware architectures for their servers) are a start, there is a lot more that’s possible and the job has hardly begun. To be the next Oracle in the space needs a completely clean sheet design from top to bottom. I’m not going to map the architecture out in great detail because its early days and frankly I don’t know all the details. But, let’s Blue Sky a bit.
Imagine an architecture that puts at least 128 x86 compatible (we need a commodity instruction set for our Suburbs) cores along with all the RAM and Flash Disc storage they need onto the equivalent of a memory stick for today’s desktop PC’s. Because power and cooling are two of the biggest challenges in modern data centers, the Core Stick will use the most miserly architectures possible–we want a lot of cores with reasonable but no extravagant clock speeds. Think per-core power consumption suitable for Mobile Devices more than desktops. For software, let’s imagine these cores run an OS Kernel that’s built around virtualization and the needs of Suburban Data from the ground up. Further, there is a service layer running on top of the OS that’s also optimized for the Suburban Data world but has the basics all ready to go: Apache Web Server and MySQL. In short, you have 128 Amazon EC2 instances potent enough to run 90% of the web sites on the Internet. Now let’s create backplanes that fit a typical 19″ rack set up with all the right UPS and DC power capabilities the big data centers already know how to do well. The name of the game will be Core Density. We get 128 on a memory stick, and let’s say 128 sticks in a 1U rack mount, so we can support 16K web instances in one of those rack mounts.
There will many valuable problems to solve with such architectures, and hence many opportunities for new players to make money. Consider what has to be done to reinvent hierarchical storage manage for such architectures. We’ve got a Flash local disc with each core, but it is probably relatively small. Hence we need access to storage on a hierarchical basis so we can consume as much as we want and it seamlessly works. Or, consider communicating with and managing the cores. The only connections to the Core Stick should be very high speed Ethernet and power. Perhaps we’ll want some out of band control signals for security’s sake as well. Want to talk to one of these little gems, just fire up the browser and connect to its IP address. BTW, we probably want full software net fabric capabilities on the stick.
It’ll take quite a while to design, build, and mature such architectures. That’s fine, it’ll give us several more Moore cycles in which to cement the inevitability of these architectures.
You see what I mean when I say this is a whole new ballgame and a much bigger market than Big Data? It goes much deeper and will wind up being the fabric of the Internet and Cloud of tomorrow.