We finally have a conclusion on how exactly to interpret a robot.txt file for the edge cases mentioned here.  Someone started a WebmasterWorld thread on the subject of contention.

Indeed, according to the specification, the rules for a specific matching user agent entirely override the "User-agent: *" rules.  Therefore, any rule under "User-agent: *" that should also be applied to a specific bot must be repeated under the "User-agent:" for that specific bot.  In other words, the more specific set of directives takes precedence over the default, and only one set is applied.

Googleguy says in the thread that he " … believes most/all search engines interpret robots.txt this way … "  This is also consistent with my testing.

However, some recommend placing the "*" rule last just in case, because some bots may not follow this specification, and take the first match — even if it's a "*."  Doing so achieves the intended result regardless.

Tell an amigo:
  • Sphinn
  • Digg
  • Reddit
  • del.icio.us
  • StumbleUpon
  • Facebook



Related posts:
Google Robots.txt Snafu (Update) Some people may know about this already, but it's worth...
Google Robots.txt Snafu: Part II I decided that I would test what I think is...
Google's Borked Robots.txt I've never assumed that the "Allow:" directive was supported by...
Wildcard Robots.txt Matching Is Now (Almost) Standard By way of this post on Search Engine Roundtable, I...
CSS Spam and Robots.txt What really stops anyone from using a CSS-based layout, throwing...