Why Don’t Search Startups Share Data? (aka Open Source Style Web Crawling)
Posted by Bob Warfield on August 7, 2007
New Jersey Search Engine startup Accoona is filing for IPO after just a few short years of operation. Whether you think they’re a good investment or not (http://www.alleyinsider.com/2007/08/analyzing-accoo.html), there’s definitely some feeling there’s gold left in them thar hills and the Googleplex hasn’t taken it all yet.
At the same time, there is an interesting discussion on Skrentablog that asks if there are 100 alternate seach engines how come only about 11 seem to be crawling the net? Accoona is one of the 11 actually detected, BTW. Richard MacManus is similarly perplexed, and his ReadWriteWeb blog post considers some possible explanations, but winds up baffled.
I appreciate the mystery. Are these niche search engines that just haven’t hit all the sites? Are they purchasing their index from somewhere? Do they know some other way besides crawling to create an index? Are they really smart about not crawling until a page changes somehow? And does this deafening silence from crawlers indicate that the Googles and Yahoo’s of the world really have sewn up the search world, that you have to be big to do it, and that others need not apply?
I’m reminded of a novel web crawler I was involved with at a startup I founded called iMiner/PriceRadar. We had a nifty data mining application that tapped into eBay to optimize listings. During our early days, we had to crawl eBay and collect all the data. We discovered early on that their denial of service deflector shields where blocking our IP before we could gather much data. What to do? In those days, DSL was just coming on stream, so we bought every employee a DSL connection and built a distributed spider that ran from their homes. They’d install a little applet on their PC and we collected all the data we wanted because the load was spread over enough IP’s it didn’t trigger eBay’s ire. Meanwhile, the employees got a great perk. In effect, nobody could’ve seen we were crawling the web (or eBay in this case) because our presence was too diffuse.
Which brings me to my point about all this web crawling: why aren’t startups sharing the burden? There are enough of them with enough different spins that they should be able to divide and conquer the web, much as my old distributed algorithm would divide and conquer eBay. The necessary algorithm can simply be Google’s MapReduce or the Open Source alternative Hadoop. These algorithms are tailor made for distributed crawlers. One could even set things up so that the startups could participate to the extent of their needs and funding and would thereby receive only a subset of the collective work. Come to that, why doesn’t Google or Yahoo sell “Remora Slots” on their own crawlers? These would be piggyback opportunities to filter the text as it is gathered and build up custom indices of one kind or another. It’s the Amazon Web Services equivalent for a big Search provider to offer. They could charge so many cents a terabyte and offer the ability to filter what gets passed on for niche services that don’t need to whole web. Now that would be an interesting service!
The same principle should be at work throughout a lot of startup activities. Why don’t startups collaborate more frequently? We’ve demonstrated code collaboration ala Open Source has tremendous advantages. The next step is collaboration on data of various kinds. Many would say the data is even more important than the code (see Open Source and Scratching Itches in the Cloud). Combine code and data and there isn’t a lot else. The ability to combine the data likely requires the right sort of platform and api’s to make it happen, but it would be quite an innovation if it came to pass, wouldn’t it?
It’s been a long time since the idea of “Kiretsu” was in favor.