Some people may know about this already, but it's worth discussing since it has been pertinent to me a few times:
In theory, according to my interpretation of the robots.txt specification, if a Disallow: under User-agent: "*" exists, as well a Disallow: under a specific robot's User-agent:, and that robot accesses the web site, both rules should be applied, and both should be excluded. However, Google does not interpret it this way, and only applies the rules for the specific robot User-agent:, "googlebot." For Google, it is necessary to repeat all rules in "*" under googlebot’s User-agent: as well to get this behavior. This may actually be more consistent with the specification -- http://www.robotstxt.org/wc/norobots.html, but I'm not sure it's intuitive regardless.
It really all depends on how you read the specification, but my interpretation may be wrong here based on the following quote from the aforementioned URL. Thanks to hjp@hjp.at who pointed this out to me in his comment on this post.
"If the value is '*,' the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the /robots.txt file."
I guess if you think of it like a C switch statement, it depends on whether each set of rules has an implicit "break" at the end. The quote above seems to indicate as much. In C, the "*" rule, or default, is also required to be the last. It is not so in robots.txt; not that the analogy is 100% kosher anyway, but I personally find the whole issue confusing. Furthermore, it appears that MSN and Yahoo interpret it the other way.
Initially, I read the User-agent: specifier like a glob/regex match, assumed "*" was for "everyone," and assumed that all rules were interpreted as long as the User-agent: pattern matched. This is also a pretty logical interpretation.
Thus (at least for Google), if you want X, Y, and Z to be excluded -- not just Z:
User-agent: *
Disallow: X
Disallow: Y
User-agent: Googlebot
Disallow: Z
Should be changed to:
User-agent: *
Disallow: X
Disallow: Y
User-agent: Googlebot
Disallow: X
Disallow: Y
Disallow: Z
Otherwise, Google would ignore the first 2 rules. Not knowing this can wreak havoc with regard to duplicate content issues, in the worst case.
If it turns out to be the other way for MSN and Yahoo (and I think it is), it makes the issue even more confusing. Anyone care to comment?












July 21st, 2006 at 5:34 pm
See http://www.robotstxt.org/wc/norobots.html:
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
... any of the other records ...: So Googlebot conforms to the specification.
July 21st, 2006 at 8:25 pm
[...] I decided that I would test what I think is an inconsistency in the interpretation of the robots.txt specification by various implementors cited here. [...]