One day your traffic comes to a grinding halt. What happened? Check the index. Google may have found all your reciprocal links from i-hump-sheep.info and white-castle-coupons.biz. But it's also possible that you have been "proxy hacked." That's the term being tossed around by a few people who have been mum on it for awhile -- Alan Perkins, Danny Sullivan, Bill Atchison, Brad Fallon, and a few other people that are actually exploiting this hole right now (and whom we don't know).
And it's likely the reason Google (http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
), Yahoo (http://www.ysearchblog.com/archives/000460.html), MSN (http://blogs.msdn.com/livesearch/archive/2006/11/29/search-robots-in-disguise.aspx), and Ask (http://about.ask.com/en/docs/about/webmasters.shtml#21) all published guidelines to detect whether a bot is, indeed, an authentic bot. I was wondering about that when I heard about it. Now it all comes together.
So what's going on? Dan Thies has a summary up about it, and he'll probably do a much better job explaining it since he's not a programmer at heart like me. I'll end up speaking in pseudocode ... so read Dan's summary over here. Here's a tidbit that explains a lot of it away:
With the introduction of "Big Daddy," Google crawls from many different data centers; they also changed the algorithm substantially at the same time. According to Dan "It appears that the changes include moving some of the duplicate content detection down to the crawlers. [This is problematic. In short:]
1. The original page exists in at least some of the data centers.
2. A copy (proxy) gets indexed in one data center, and that gets sync'd across to the others.
3. A spider visits the original, checks to see if the content is duplicate, and erroneously decides that it is.
4. The original is dropped or penalized.
"
So ... the problem is that if you flood Google with massive amounts of duplicate content, it exposes a vulnerability. Eventually the algorithm makes a mistake, and your content is no longer authoritative.
Oops!
How To Fight Back -- Code implementations
Well that's where I come in. I have 2 implementations in beta (read: they work according to my tests, but I'm going to be testing more) that address the problem based on the methods the search engines cite. Then, essentially, we're using a benign form of cloaking (yes, cloaking!) to make it more difficult for bad bots, proxies, etc. to exploit us.
The code is located here
I'll expand the explanation in that documentation to make it easier to comprehend/install. But if you know PHP, dive right in.
The code and concepts were primarily based off on the book I coauthored, "Search Engine Optimization with PHP." It is my sentiment that most SEOs have to be aware of technology more so than they think -- hence the book authored by me and co-author Cristian Darie. This is just one example.
You can see it in action here:
http://www.seoegghead.com/tools/test-simple-cloak.php












August 17th, 2007 at 6:02 am
Its a very interesting article. When you think of the whole proxy thing, it kind of makes sense that it could cause content to be duplicated. Google really needs to get this sorted out ASAP.
Lets hope that bringing this proxy problem out in the open doesn't cause too much damage.
Anyway, great site ! Your book is currently in the mail heading my way !
August 17th, 2007 at 3:06 pm
[...] it’s easy to do so you certainly need to check it out. SEO Egghead has the solution on how to defend against proxy hacking which is heavy into PHP but doesn’t look too hard to implement. It’s sad the SEO [...]
August 18th, 2007 at 5:58 pm
I dont understand this really does someonebody could explain me plz i am not a english speaker thats why
plz email to dateameperu at yahoo dot com
August 21st, 2007 at 12:14 pm
[...] Implementation Guide is available on Jaime Sirovich’s blog to walk you through some possible preventative [...]
August 21st, 2007 at 10:15 pm
[...] Google Proxy HackingThis is the first time I’ve heard of Google Proxy Hacking so I thought I would post some info: What is it? It’s a method of using Google’s “remove duplicate content” feature to get sites removed from Google’s index or penalized. Find out more and steps to prevent it. [...]
August 23rd, 2007 at 3:45 pm
[...] is also an SEO worth reading (though he has been taking some R&R due to illness) and wrote the code to fix the proxy [...]
August 24th, 2007 at 3:45 am
[...] Google Proxy HackingThis is the first time I’ve heard of Google Proxy Hacking so I thought I would post some info: What is it? It’s a method of using Google’s “remove duplicate content” feature to get sites removed from Google’s index or penalized. Find out more and steps to prevent it. [...]
August 25th, 2007 at 11:45 am
[...] duplicate content” feature to get sites removed from Google’s index or penalized. Find out more and steps to prevent it. Tags: ecommerce web hosting, web hosting solution, free domain name, linux web hosting, free web [...]
August 29th, 2007 at 2:53 pm
Very interesting article. I'm going to have to do some more research into this. Google says there's no duplicate content penalty, but they also say landing in the supplementals isn't a bad thing. I would hate for that to happen purely because two bots crawl the same page and naturally see the same thing...!
August 31st, 2007 at 3:12 am
Hi Jamie, Dan Thies pointed me in your direction as I have several website that all run on a Windows server and primarily use HTM, with some pages being asp. However one of my websites is built in ASP entirely.
Dan mentions in his blog that you may have been working on a fix for those of us dirty enough to use ASP and windows servers. I just wanted to see if this was still the case and if you have any updates on this yet.
Thanks for your efforts so far. As someone who isn't all that savvy in web development and manages to just get by, this is a real help.
Kirk
September 2nd, 2007 at 5:12 am
Hi All,
My site url has been hacked! normaly my site shows up when you type "sms4niets"in google search. Now if you type : sms4niets" in google it redirects to a complete other site ( not mine!) im not very good at php but if somebody could help me implement a sollution?? Please i need some help on this subject. Thanks!! ''
Erwin dus (@) casema.nl
September 3rd, 2007 at 10:15 pm
[...] Google Proxy HackingThis is the first time I’ve heard of Google Proxy Hacking so I thought I would post some info: What is it? It’s a method of using Google’s “remove duplicate content” feature to get sites removed from Google’s index or penalized. Find out more and steps to prevent it. [...]
September 4th, 2007 at 7:31 am
[...] How To Guide: Prevent Google Proxy Hacking [...]
September 4th, 2007 at 7:31 am
[...] How To Guide: Prevent Google Proxy Hacking [...]
September 4th, 2007 at 9:36 am
[...] How To Guide: Prevent Google Proxy Hacking [...]
September 5th, 2007 at 12:15 am
I'm with Kirk,
we have a mix of ASP and HTML and have been fighting this and scrapers for the last couple of years. We really need help. Please keep me posted on your IIS solution.
September 9th, 2007 at 11:50 am
I wrote a small article and some scripts that might come in handy when fighting this including a small explanation.
Dedicated to the PHP coders!
September 26th, 2007 at 7:00 am
[...] duplicate content” feature to get sites removed from Google’s index or penalized. Find out more and steps to prevent it. Tags: ecommerce, web hosting provider, seo tool, web hosting company, ecommerce shopping cart, seo [...]
September 26th, 2007 at 8:45 pm
[...] Google Proxy HackingThis is the first time I’ve heard of Google Proxy Hacking so I thought I would post some info: What is it? It’s a method of using Google’s “remove duplicate content” feature to get sites removed from Google’s index or penalized. Find out more and steps to prevent it. [...]
September 28th, 2007 at 2:00 pm
[...] Google Proxy HackingThis is the first time I’ve heard of Google Proxy Hacking so I thought I would post some info: What is it? It’s a method of using Google’s “remove duplicate content” feature to get sites removed from Google’s index or penalized. Find out more and steps to prevent it. [...]
September 28th, 2007 at 9:39 pm
[...] to get the code: An implementation guide is provided on Jaimie’s blog, along with a testing environment that you can use to check [...]
October 8th, 2007 at 2:31 pm
[...] Google Proxy HackingThis is the first time I’ve heard of Google Proxy Hacking so I thought I would post some info: What is it? It’s a method of using Google’s “remove duplicate content” feature to get sites removed from Google’s index or penalized. Find out more and steps to prevent it. [...]
October 24th, 2007 at 8:41 am
I've read Dan's post too and wondered how he implemented his proxy-hack-solution. Thanks Jaimie for sharing yours