One day your traffic comes to a grinding halt. What happened? Check the index. Google may have found all your reciprocal links from i-hump-sheep.info and white-castle-coupons.biz. But it's also possible that you have been "proxy hacked." That's the term being tossed around by a few people who have been mum on it for awhile — Alan Perkins, Danny Sullivan, Bill Atchison, Brad Fallon, and a few other people that are actually exploiting this hole right now (and whom we don't know).

And it's likely the reason Google (http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
), Yahoo (http://www.ysearchblog.com/archives/000460.html), MSN (http://blogs.msdn.com/livesearch/archive/2006/11/29/search-robots-in-disguise.aspx), and Ask (http://about.ask.com/en/docs/about/webmasters.shtml#21) all published guidelines to detect whether a bot is, indeed, an authentic bot. I was wondering about that when I heard about it. Now it all comes together.

So what's going on? Dan Thies has a summary up about it, and he'll probably do a much better job explaining it since he's not a programmer at heart like me. I'll end up speaking in pseudocode … so read Dan's summary over here. Here's a tidbit that explains a lot of it away:

With the introduction of "Big Daddy," Google crawls from many different data centers; they also changed the algorithm substantially at the same time. According to Dan "It appears that the changes include moving some of the duplicate content detection down to the crawlers. [This is problematic. In short:]

1. The original page exists in at least some of the data centers.
2. A copy (proxy) gets indexed in one data center, and that gets sync'd across to the others.
3. A spider visits the original, checks to see if the content is duplicate, and erroneously decides that it is.
4. The original is dropped or penalized.
"
So … the problem is that if you flood Google with massive amounts of duplicate content, it exposes a vulnerability. Eventually the algorithm makes a mistake, and your content is no longer authoritative.

Oops!

How To Fight Back — Code implementations

Well that's where I come in. I have 2 implementations in beta (read: they work according to my tests, but I'm going to be testing more) that address the problem based on the methods the search engines cite. Then, essentially, we're using a benign form of cloaking (yes, cloaking!) to make it more difficult for bad bots, proxies, etc. to exploit us.

The code is located here

I'll expand the explanation in that documentation to make it easier to comprehend/install. But if you know PHP, dive right in.

The code and concepts were primarily based off on the book I coauthored, "Search Engine Optimization with PHP." It is my sentiment that most SEOs have to be aware of technology more so than they think — hence the book authored by me and co-author Cristian Darie. This is just one example.

You can see it in action here:
http://www.seoegghead.com/tools/test-simple-cloak.php

Tell an amigo:
  • Sphinn
  • Digg
  • Reddit
  • del.icio.us
  • StumbleUpon
  • Facebook



Related posts:
The Google Cloaking Hypocrisy I've been digesting this for awhile.  Barry Schwartz of Search...
Google Violates Computer Science! People have too much faith in Google – even when...
403 Forbidden: A Geek's Guide to Rejection I'm a geek.  Based on the content of this blog,...