This is an old topic for me; one that I feel is largely ignored by the SEM community. I asked Jake Baillie of TrueLocal at an SES conference awhile back, and he suggested that breadcrumbs are trouble when it comes to duplicate content no-matter-what, and, of course, I used to work for Barry Schwartz of RustyBrick, who is a big believer in breadcrumbs, but believes that the search engines should just "deal with it." He may have changed his mind, so take it with a grain of salt. Don't quote me.
I tend to agree with Barry, but I'm not sure the search engines do. And I know at least on one of the sites I worked with, we deliberately varied content on each particular instance of a product in each category. It's one of those cases where legitimate sites have to use anti-spam paranoid tactics because of how aggressive spam-associated behavior is being targeted (one of which is duplicate content penalties).
So here is a summary of the ways that we can address the duplicate content issues if we do wish to address it:
a) Using primary categories and "robots.txt" or meta-exclusion.
This is the idea espoused by Dan Thies' in his SEM book as well. The upside is that it's bullet-proof. You will never be penalized by a search engine for having duplicated pages. But there are 2 downsides.
1) Very often the keywords from your categories (in the title, perhaps under the breadcrumb, or in the "suggested" products) may yield unexpected rankings for what I call "permutation" keywords. Obviously, with this solution, you only get one of the permutations -- the primary one.
2) Users may passively "penalize" you by linking the non-primary page. A link to an excluded page has questionable link-value, arguably none for that page -- but perhaps to the domain in general. Anyone care to comment on this?
b) Changing up the content on the various permutations.
This is the solution I elicited above. Done right this can also work. Remember, spamming is evil :).
C) Using HTTP_REFERER (They spelled it wrong, not me).
This is a new idea of mine. Basically, it involves using the referrer and parsing it to figure out the category the user navigated from. The upside is that there is only 1 URL, despite the fact that you do get (mostly) functional breadcrumbs. Users will always link to that URL, so there's no issue in that regard, either. The downside is that users won't always be navigating from the category pages, and in that case, you must resort to a "primary" category instead. This is not so bad. I'm currently experimenting with this idea. It's not perfect, though. And I'm not sure if it could be detected as a low-grade sort of "cloaking."
OK, so I'm experimenting with "C," but which method do I use most of the time? A. It's the method I use on Lawyerseek. If you look closely at the 2 links for Protopic, one of the many drugs on the site:
There are 2 URLs, but one is excluded in the robots.txt file --
http://www.lawyerseek.com/Practice/In-the-News-C20/Protopic-P38/
http://www.lawyerseek.com/Practice/Pharmaceutical-Injury-C1/Protopic-P38/
The former is excluded. I feel this is the safest method.












June 15th, 2006 at 7:19 am
[...] Suppose a webmaster excludes a duplicated page on his site using robots.txt or meta exclusion, but then a user proceeds to link to it anyway. This is one of the problems with excluding the duplicate content. More specifically, this is the method I typically use to eliminate the duplicate content as a result of breadcrumb navigation — see this blog entry for more information on that. [...]
November 2nd, 2006 at 3:41 am
[...] I know why they're doing this, but there are better ways. Sessions could be used to store the page that they arrived on. It's not always 100%, but it's better than the duplicate content issues that ensue in this case. The category issue can be solved 3 ways. See this post. [...]