This is an old topic for me; one that I feel is largely ignored by the SEM community. I asked Jake Baillie of TrueLocal at an SES conference awhile back, and he suggested that breadcrumbs are trouble when it comes to duplicate content no-matter-what, and, of course, I used to work for Barry Schwartz of RustyBrick, who is a big believer in breadcrumbs, but believes that the search engines should just "deal with it." He may have changed his mind, so take it with a grain of salt.  Don't quote me.

I tend to agree with Barry, but I'm not sure the search engines do.  And I know at least on one of the sites I worked with, we deliberately varied content on each particular instance of a product in each category.  It's one of those cases where legitimate sites have to use anti-spam paranoid tactics because of how aggressive spam-associated behavior is being targeted (one of which is duplicate content penalties).

So here is a summary of the ways that we can address the duplicate content issues if we do wish to address it:

a) Using primary categories and "robots.txt" or meta-exclusion.
     This is the idea espoused by Dan Thies' in his SEM book as well.  The upside is that it's bullet-proof.  You will never be penalized by a search engine for having duplicated pages.  But there are 2 downsides.
        1) Very often the keywords from your categories (in the title, perhaps under the breadcrumb, or in the "suggested" products) may yield unexpected rankings for what I call "permutation" keywords.  Obviously, with this solution, you only get one of the permutations — the primary one.
        2) Users may passively "penalize" you by linking the non-primary page.  A link to an excluded page has questionable link-value, arguably none for that page — but perhaps to the domain in general.  Anyone care to comment on this?

 b) Changing up the content on the various permutations.
    This is the solution I elicited above.  Done right this can also work.  Remember, spamming is evil :).

 C) Using HTTP_REFERER (They spelled it wrong, not me).
    This is a new idea of mine.  Basically, it involves using the referrer and parsing it to figure out the category the user navigated from.  The upside is that there is only 1 URL, despite the fact that you do get (mostly) functional breadcrumbs.  Users will always link to that URL, so there's no issue in that regard, either. The downside is that users won't always be navigating from the category pages, and in that case, you must resort to a "primary" category instead.  This is not so bad.  I'm currently experimenting with this idea.  It's not perfect, though.  And I'm not sure if it could be detected as a low-grade sort of "cloaking."

OK, so I'm experimenting with "C," but which method do I use most of the time?  A.  It's the method I use on Lawyerseek.  If you look closely at the 2 links for Protopic, one of the many drugs on the site:

There are 2 URLs, but one is excluded in the robots.txt file –

http://www.lawyerseek.com/Practice/In-the-News-C20/Protopic-P38/
http://www.lawyerseek.com/Practice/Pharmaceutical-Injury-C1/Protopic-P38/

The former is excluded.  I feel this is the safest method.

Tell an amigo:
  • Sphinn
  • Digg
  • Reddit
  • del.icio.us
  • StumbleUpon
  • Facebook



Related posts:
URL Normalization; Slashing Duplicate Content To be honest, I'm not even sure this matters much...
Using Syndicated Content in Moderation Syndicated content is content that is authored by another source...
Google's Expired Domain Penalty and Content Theft Search engine marketers loved SnapNames.  Expired domains used to evade...
Does a Link to an Excluded Page on a Site Have any Link Value? Suppose a webmaster excludes a duplicated page on his site...
How to Deal with Content Theft Incidentally, some scraper, here, was stealing my content and posting...