The Science of SPAM (Hint: It’s an Arms Race | And: A Better Mahalo)
Posted by Bob Warfield on October 10, 2007
I read an interesting article in Grey Hat SEOblog about some of the techniques SEO spammers use. Before I go further, let me be absolutely crystal clear: I. Hate. Spam!
But, it is interesting, and sometimes quite useful to know your foe. So it is with this Grey Hat SEO article. It answered a question I’d wondered about: Why do spammers send email or contruct web pages that have seemingly random collections of words on them? Yes, they may trick a search engine into going there, but it seems that real content would be more like to cause someone to actually stay there and do something monetizable. It all seemed like a colossally misguided waste of effort that needlessly annoys people. But there is a method to this madness, as it turns out.
Those machine-generated spammy messages are the equivalent of Star Wars Imperial Probe Droids. They are performing reconaissance prior to sending in heavier forces. Here is how it works. They’re taking candidate lists of terms and combining them together in sophisticated ways and then checking how far up the search result page (what they call a “SERP”) that gets them. When they reach the highest possible level using these techniques, they have identified a chink in the defenses that becomes the starting point for the next level of effort.
That next level involves creating real content around those keyword combinations that will pass muster by humans and presumably be a little stickier for those that land there. From that base, they then look to create as many linkages as possible to the page to get it even higher in the search results based on the PageRank algorithm.
Amazingly ingenious and methodical, isn’t it? Who would have thunk.
Now here is the next piece of the puzzle: it’s an arms race. If you have a web site and monitor how many Google hits there are on say, your company name, you will notice there is a tremendous ebb and flow. “SmoothSpan”, as I write this, fluctuates from about 5,000 hits all the way up to 20,000 hits. My first reaction on seeing that was to wonder why people were adding and removing references on so many sites so often. It didn’t make sense. For a little while watching the daily behaviour I wondered whether it didn’t reflect a common trait of massively scaled sites like Google that wind up emphasizing availability over consistency. Perhaps my search was being handled by some nodes that just didn’t have all the info and couldn’t return a full set of links.
Eventually I went out looking at some of these and discovered, low and behold, tons of machine-generated pages. I now realize that many were spammer probe droids. Others were odd artifacts of various web sites. For example, lots of sites seem to create pages associated with each tag that have links relating to the tag. These pages change constantly depending on what you write about on your blog. It’s hard for me to see finding them in Google as a very useful search result, but I presume reaching them as part of another application is viewed as a good thing. It’s always intriguing to find parts of the machine exposed for general viewing.
The reason the hit counts fluctuate so much is that Google is constantly adding heuristics to try to eliminate these probe droid pages. It’s literally an arms race. We read recently about how they’re penalizing sites that sell links, for example. If you wind up on that list, your search rank will be permanently lowered. If that’s not open warfare, I don’t know what is!
FWIW, I continue to find that searching the blogosphere using Google Blog Search is a better starting point for most of my searches than searching the whole web. Try it some time. It’s an easy habit to pick up and it really works well. I think of it as the poor man’s Mahalo. Why would I need Mahalo to pre-process my search with humans when there are so many bloggers doing it already?
Perhaps this is the answer to Scoble’s lament that great content is now a commodity. He also talks about how hard it is to get a lot of link juice in the blogosphere. Scoble takes all this and translates it to boredom with blogging and the difficulties of getting ahead in the blog world. But ask yourself how you would expect things to behave if the content in the blogosphere was of radically higher quality on average than the great unwashed web?
I think it explains the situation pretty well. Information friction is low, quality is very high. I know from my own experience of searching the blogosphere first that I am much more productive. So what does that say for link juice? That it will be less a function of quality (which seems to be common in the blogosphere) and more a function of whether you’re talking about what people are interested in at the moment. That’s why the services being discussed are creating such spikes–they are transient interest points.
Getting back to our original theme, it must be perplexing to the SEO world. Eventually they will target the blog world, anyone who has a blog sees them trying, yet it is hard for various reasons. They haven’t yet nailed the science of spam in this world. I hope it takes a long time yet before they do!
It’s not about the eyeballs, it’s the brains. This seems to fit with what I’m saying about high quality content in the blogosphere and spikes of interest that aren’t deep.