- Aug. 22nd, 2006
- 5 comments
I used to assume that content behind forms was never spidered. This does not seem to be the case, as one particular form on this blog made me aware.
It appears that if Google sees a form consisting only of 1 pulldown (select), it will spider the links created by submitting the form request with the various values in the pulldown. This has a few implications:
1. Google may also spider a form consisting of any control with a finite domain, such as a group of radio buttons. It could also decide to spider forms with multiple controls having finite domains — perhaps a select and a radio button selection combined.
2. One can no longer assume that this does not happen, and if the URLs created by the form request yield duplicate content, or just stuff you don't want spidered, they should be excluded somehow. I'm positive that these links were not present anywhere on this blog.
Notably, the pulldown for "Choose a Topic:" in my template fits these criteria. The URLs it generated were not the rewritten permalink categories that my installation of WordPress uses. This created a duplicate page for every category, for example "www.seoegghead.com/?cat=10." Since it's a form, there is no way to get it to use the rewritten permalinks — that's just how a form works. It's not a deficiency or quirk in WordPress.
The solution was to add the following line to my robots.txt:
This lets me keep the pulldown, but prevents Google from spidering anything behind it.
No other search engines appear to do this right now. And to be honest, I agree with what Google is doing here. They're seeking to index as much as possible. I just figured I'd smash the "common knowledge" that forms are never spidered. They apparently are sometimes.
"5 Wise Comments Banged Out Somewhere On The Internet ..."