Some people may know about this already, but it's worth discussing since it has been pertinent to me a few times:

In theory, according to my interpretation of the robots.txt specification, if a Disallow: under User-agent: "*" exists, as well a Disallow: under a specific robot's User-agent:, and that robot accesses the web site, both rules should be applied, and both should be excluded.  However, Google does not interpret it this way, and only applies the rules for the specific robot User-agent:, "googlebot."  For Google, it is necessary to repeat all rules in "*" under googlebot’s User-agent: as well to get this behavior.  This may actually be more consistent with the specification — http://www.robotstxt.org/wc/norobots.html, but I'm not sure it's intuitive regardless.

It really all depends on how you read the specification, but my interpretation may be wrong here based on the following quote from the aforementioned URL.  Thanks to hjp@hjp.at who pointed this out to me in his comment on this post.

"If the value is '*,' the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the /robots.txt file."

I guess if you think of it like a C switch statement, it depends on whether each set of rules has an implicit "break" at the end.  The quote above seems to indicate as much.  In C, the "*" rule, or default, is also required to be the last.  It is not so in robots.txt; not that the analogy is 100% kosher anyway, but I personally find the whole issue confusing.  Furthermore, it appears that MSN and Yahoo interpret it the other way.

Initially, I read the User-agent: specifier like a glob/regex match, assumed "*" was for "everyone," and assumed that all rules were interpreted as long as the User-agent: pattern matched.  This is also a pretty logical interpretation.

Thus (at least for Google), if you want X, Y, and Z to be excluded — not just Z:

User-agent: *
Disallow: X
Disallow: Y
User-agent: Googlebot
Disallow: Z

Should be changed to:

User-agent: *
Disallow: X
Disallow: Y
User-agent: Googlebot
Disallow: X
Disallow: Y
Disallow: Z

Otherwise, Google would ignore the first 2 rules.  Not knowing this can wreak havoc with regard to duplicate content issues, in the worst case.

If it turns out to be the other way for MSN and Yahoo (and I think it is), it makes the issue even more confusing.  Anyone care to comment?

Tell an amigo:
  • Sphinn
  • Digg
  • Reddit
  • del.icio.us
  • StumbleUpon
  • Facebook



Related posts:
Google Robots.txt Snafu: Part II I decided that I would test what I think is...
Google Robots.txt Snafu: Part III (Conclusion) We finally have a conclusion on how exactly to interpret...
Google's Borked Robots.txt I've never assumed that the "Allow:" directive was supported by...
Wildcard Robots.txt Matching Is Now (Almost) Standard By way of this post on Search Engine Roundtable, I...
CSS Spam and Robots.txt What really stops anyone from using a CSS-based layout, throwing...