Aug 16

How To Guide: Prevent Google Proxy Hacking

Posted by Jaimie Sirovich on Aug. 16th, 2007. 23 comments — voice your opinion.

NEED A GREAT WEB SITE? NEED IT TO BE SEARCH-ENGINE-FRIENDLY?

SEO Egghead is a web development firm dedicated to creating custom, search engine optimized web site applications. We specialize in eCommerce and content management web sites that not only render information beautifully to the human, but also satisfy the "third browser" — the search engine. To us, search engines are people too. Click here to talk to us. We'd love to help!
X

One day your traffic comes to a grinding halt. What happened? Check the index. Google may have found all your reciprocal links from i-hump-sheep.info and white-castle-coupons.biz. But it's also possible that you have been "proxy hacked." That's the term being tossed around by a few people who have been mum on it for awhile — Alan Perkins, Danny Sullivan, Bill Atchison, Brad Fallon, and a few other people that are actually exploiting this hole right now (and whom we don't know).

And it's likely the reason Google (http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.html
), Yahoo (http://www.ysearchblog.com/archives/000460.html), MSN (http://blogs.msdn.com/livesearch/archive/2006/11/29/search-robots-in-disguise.aspx), and Ask (http://about.ask.com/en/docs/about/webmasters.shtml#21) all published guidelines to detect whether a bot is, indeed, an authentic bot. I was wondering about that when I heard about it. Now it all comes together.

So what's going on? Dan Thies has a summary up about it, and he'll probably do a much better job explaining it since he's not a programmer at heart like me. I'll end up speaking in pseudocode … so read Dan's summary over here. Here's a tidbit that explains a lot of it away:

With the introduction of "Big Daddy," Google crawls from many different data centers; they also changed the algorithm substantially at the same time. According to Dan "It appears that the changes include moving some of the duplicate content detection down to the crawlers. [This is problematic. In short:]

1. The original page exists in at least some of the data centers.
2. A copy (proxy) gets indexed in one data center, and that gets sync'd across to the others.
3. A spider visits the original, checks to see if the content is duplicate, and erroneously decides that it is.
4. The original is dropped or penalized.
"
So … the problem is that if you flood Google with massive amounts of duplicate content, it exposes a vulnerability. Eventually the algorithm makes a mistake, and your content is no longer authoritative.

Oops!

How To Fight Back — Code implementations

Well that's where I come in. I have 2 implementations in beta (read: they work according to my tests, but I'm going to be testing more) that address the problem based on the methods the search engines cite. Then, essentially, we're using a benign form of cloaking (yes, cloaking!) to make it more difficult for bad bots, proxies, etc. to exploit us.

The code is located here

I'll expand the explanation in that documentation to make it easier to comprehend/install. But if you know PHP, dive right in.

The code and concepts were primarily based off on the book I coauthored, "Search Engine Optimization with PHP." It is my sentiment that most SEOs have to be aware of technology more so than they think — hence the book authored by me and co-author Cristian Darie. This is just one example.

You can see it in action here:
http://www.seoegghead.com/tools/test-simple-cloak.php

Tell an amigo:
  • Sphinn
  • Digg
  • Reddit
  • del.icio.us
  • StumbleUpon
  • Facebook



Related posts:
The Google Cloaking Hypocrisy I've been digesting this for awhile.  Barry Schwartz of Search...
Google Violates Computer Science! People have too much faith in Google – even when...
Google Robots.txt Snafu (Update) Some people may know about this already, but it's worth...
3 Tips For SEO-Aware Split Testing Often, marketers want to create several variations on content for...
ASP.NET 2.0 Setting Dangerous for Google Indexing Authored By: Cristian Darie http://www.cristiandarie.ro/books. I'm writing this article to...




"23 Wise Comments Banged Out Somewhere On The Internet ..."


Steve Goddard

Its a very interesting article. When you think of the whole proxy thing, it kind of makes sense that it could cause content to be duplicated. Google really needs to get this sorted out ASAP.

Lets hope that bringing this proxy problem out in the open doesn't cause too much damage.

Anyway, great site ! Your book is currently in the mail heading my way !

jose peru

I dont understand this really does someonebody could explain me plz i am not a english speaker thats why
plz email to dateameperu at yahoo dot com

Forrest

Very interesting article. I'm going to have to do some more research into this. Google says there's no duplicate content penalty, but they also say landing in the supplementals isn't a bad thing. I would hate for that to happen purely because two bots crawl the same page and naturally see the same thing…!

Kirk

Hi Jamie, Dan Thies pointed me in your direction as I have several website that all run on a Windows server and primarily use HTM, with some pages being asp. However one of my websites is built in ASP entirely.

Dan mentions in his blog that you may have been working on a fix for those of us dirty enough to use ASP and windows servers. I just wanted to see if this was still the case and if you have any updates on this yet.

Thanks for your efforts so far. As someone who isn't all that savvy in web development and manages to just get by, this is a real help.

Kirk

erwin

Hi All,

My site url has been hacked! normaly my site shows up when you type "sms4niets"in google search. Now if you type : sms4niets" in google it redirects to a complete other site ( not mine!) im not very good at php but if somebody could help me implement a sollution?? Please i need some help on this subject. Thanks!! "

Erwin dus (@) casema.nl

Chris

I'm with Kirk,
we have a mix of ASP and HTML and have been fighting this and scrapers for the last couple of years. We really need help. Please keep me posted on your IIS solution.

Proxy Hi.Jack

I wrote a small article and some scripts that might come in handy when fighting this including a small explanation.

Dedicated to the PHP coders!

Phoenix

I've read Dan's post too and wondered how he implemented his proxy-hack-solution. Thanks Jaimie for sharing yours :)

Friday Tea Time - 8/17/07 » TheMadHat

[...] it's easy to do so you certainly need to check it out. SEO Egghead has the solution on how to defend against proxy hacking which is heavy into PHP but doesn't look too hard to implement. It's sad the SEO [...]

Google Proxy Issue - Any Third Party Can De-Index you! | Reviewer of Sites

[...] Implementation Guide is available on Jaime Sirovich's blog to walk you through some possible preventative [...]

HassleFreeWebSites.com » Blog Archive » Google Backlinks Update in Progress

[...] Google Proxy HackingThis is the first time I've heard of Google Proxy Hacking so I thought I would post some info: What is it? It's a method of using Google's "remove duplicate content" feature to get sites removed from Google's index or penalized. Find out more and steps to prevent it. [...]

Pushing WordPress SEO Boundaries | Andy Beard - Niche Marketing

[...] is also an SEO worth reading (though he has been taking some R&R due to illness) and wrote the code to fix the proxy [...]

HassleFreeWebSites.com » Blog Archive » Big Google SERP Changes

[...] Google Proxy HackingThis is the first time I've heard of Google Proxy Hacking so I thought I would post some info: What is it? It's a method of using Google's "remove duplicate content" feature to get sites removed from Google's index or penalized. Find out more and steps to prevent it. [...]

HassleFreeWebSites.com » Blog Archive » Big Google SERP Changes

[...] duplicate content" feature to get sites removed from Google's index or penalized. Find out more and steps to prevent it. Tags: ecommerce web hosting, web hosting solution, free domain name, linux web hosting, free web [...]

HassleFreeWebSites.com » Blog Archive » New MSN Live Search Webmaster Portal

[...] Google Proxy HackingThis is the first time I've heard of Google Proxy Hacking so I thought I would post some info: What is it? It's a method of using Google's "remove duplicate content" feature to get sites removed from Google's index or penalized. Find out more and steps to prevent it. [...]

Internet Marketing Campus » Archive » Can Proxy Hacking Remove Your Site From ?

[...] How To Guide: Prevent Google Proxy Hacking [...]

HassleFreeWebSites.com » Blog Archive » Top Paying AdSense Keywords

[...] duplicate content" feature to get sites removed from Google's index or penalized. Find out more and steps to prevent it. Tags: ecommerce, web hosting provider, seo tool, web hosting company, ecommerce shopping cart, seo [...]

HassleFreeWebSites.com » Blog Archive » Google Updates

[...] Google Proxy HackingThis is the first time I've heard of Google Proxy Hacking so I thought I would post some info: What is it? It's a method of using Google's "remove duplicate content" feature to get sites removed from Google's index or penalized. Find out more and steps to prevent it. [...]

HassleFreeWebSites.com » Blog Archive » Reports of Google Dropping Indexed Pages

[...] Google Proxy HackingThis is the first time I've heard of Google Proxy Hacking so I thought I would post some info: What is it? It's a method of using Google's "remove duplicate content" feature to get sites removed from Google's index or penalized. Find out more and steps to prevent it. [...]

» ä»»ä½•äººéƒ½å¯ä»¥å°†ä½ çš„ç½‘ç«™ä»Žæœç´¢å¼•æ“Žç»“æžœä¸­åˆ é™¤ SERPS.CN: 分享SEO经验、资源、技巧

[...] to get the code: An implementation guide is provided on Jaimie's blog, along with a testing environment that you can use to check [...]

HassleFreeWebSites.com » Blog Archive » WiseNut is gone for good

[...] Google Proxy HackingThis is the first time I've heard of Google Proxy Hacking so I thought I would post some info: What is it? It's a method of using Google's "remove duplicate content" feature to get sites removed from Google's index or penalized. Find out more and steps to prevent it. [...]



Care To Bang On The Keys ... ?

BECOME AN EGGHEAD. SUBSCRIBE TO OUR RSS FEED!

Learn to be as nerdy as we are by never missing our latest blog entries. Receive great tips, tricks, and ideas on improving your web site every day! Subscribe via our RSS Feed or use the chicklets in the sidebar.