SEO Egghead by Jaimie Sirovich: A blog about SEO, written for nerds, by a nerd.

Choose a Topic:

» Suggest a topic or buzz to cover; if I write about it, you'll get credit with a link in the post!

Tue
22
Aug '06

Google Spiders (Very) Simple Forms

I used to assume that content behind forms was never spidered.  This does not seem to be the case, as one particular form on this blog made me aware.

It appears that if Google sees a form consisting only of 1 pulldown (select), it will spider the links created by submitting the form request with the various values in the pulldown.  This has a few implications:

1. Google may also spider a form consisting of any control with a finite domain, such as a group of radio buttons.  It could also decide to spider forms with multiple controls having finite domains -- perhaps a select and a radio button selection combined.
2.
One can no longer assume that this does not happen, and if the URLs created by the form request yield duplicate content, or just stuff you don't want spidered, they should be excluded somehow.  I'm positive that these links were not present anywhere on this blog.

Notably, the pulldown for "Choose a Topic:" in my template fits these criteria.  The URLs it generated were not the rewritten permalink categories that my installation of WordPress uses.  This created a duplicate page for every category, for example "www.seoegghead.com/?cat=10."  Since it's a form, there is no way to get it to use the rewritten permalinks -- that's just how a form works.  It's not a deficiency or quirk in WordPress.

The solution was to add the following line to my robots.txt:

Disallow: /?cat=

This lets me keep the pulldown, but prevents Google from spidering anything behind it.

No other search engines appear to do this right now.  And to be honest, I agree with what Google is doing here.  They're seeking to index as much as possible.  I just figured I'd smash the "common knowledge" that forms are never spidered.  They apparently are sometimes.

These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  E-Mail This Post/Page

5 Responses to “Google Spiders (Very) Simple Forms”

  1. Melanie Phung Says:

    Yikes. Do you think this could at some point extend to forms without predetermined selections, but where there is technically speaking a finite number of possibilities? For example, phone number fields or zip code entry boxes.

    I'm assuming no. But I also assumed that the bots didn't index URLs launched as Java Script pop-ups (which we now know isn't true) and that anything behind a form was off-limits (which if what you say is true -- and it seems 100% reasonable -- is a false assumption also).

  2. Leslie Hensley Says:

    The "Choose a Topic" form on this site has a method of "GET". That seems reasonable for Google to spider given their history with the Google Web Accelerator. I'm guessing that "POST" forms will never be spidered by Google.

  3. Melanie Phung Says:

    That makes a lot of sense Leslie, thanks. (But maybe, just to be on the safe side, I'll start thinking about robots exclusions for page variations I don't want spidered.)

  4. Jaimie Sirovich Says:

    Actually, she has a good point. According to the RFC, GETs are for GETting data, and POSTs are for mutating data. Even though nobody follows that logic, if Google didn't at least do that, and as a result deleted your whole web site or something, they could have lawsuit on their hands.

    Not that I haven't heard about people leaving stuff like live (not even form-based) GET delete links on their sites, though ...

  5. Melanie Phung Says:

    Jaimie - are you familiar with this incident?
    http://www.all-about-content.com/2006/03/wtf-googlebot-of-doom.html

    Hilarious, but scary.

Leave a Reply