Jan 2

How You Can Stop Dirty Feed Scrapers In 3 Easy Steps

Posted by Jaimie Sirovich on Jan. 2nd, 2007. 39 comments — voice your opinion.

NEED A GREAT WEB SITE? NEED IT TO BE SEARCH-ENGINE-FRIENDLY?

SEO Egghead is a web development firm dedicated to creating custom, search engine optimized web site applications. We specialize in eCommerce and content management web sites that not only render information beautifully to the human, but also satisfy the "third browser" — the search engine. To us, search engines are people too. Click here to talk to us. We'd love to help!
X

Stealing is wrong; but some people just don't seem to get it when it comes to intellectual property.  Some of my posts take a few hours to write.  It's just plain annoying when people steal my work.  I'm sure that you feel the same way.

Now, normally, I don't call out spammers — but since this fine individual also decided to "syndicate" both Matt Cutts (spam assassin extraordinaire) and SEO Black Hat (spammer extraordinaire), I will document exactly what he is doing, and how to stop it.  (Yes, someone is stupid enough to steal content from Matt! — but he was very careful to nofollow all the links back to the original sources. He must read Matt's blog too!)

All of us who use WordPress automatically generate web feeds.  Feeds provide the same information as our web pages — but in common XML-based format so that applications such as feed readers can process and aggregate information from various sources.  By default, WordPress provides the full content of the post in its feeds.

Unfortunately this also permits convenient access to spammers seeking content to use in their spamming enterprises.  Though it's possible to remove the full content from the feed — and only provide excerpts, this makes your feed less useful.  Thankfully, most spammers aren't too bright, and access your feed from the same IP address as the spam web site it is posted on.  So here's how to block them:

1. Get the IP of the web site that is stealing your content.

%ping www.trafficboosterpro.com
PING trafficboosterpro.com (74.52.58.162): 56 data bytes

2. Search your logs for that IP address (via SSH).

%cat www.20061231 | grep "74.52.58.162"
74.52.58.162 - - [31/Dec/2006:01:00:38 -0500] "GET /blog/feed/ HTTP/1.0" 200 49330 "-" "TrafficBoosterPRo (+http://TrafficBoosterPro.com/)"

3. Place the following directives in your .htaccess file.

RewriteEngine On
RewriteCond %{REMOTE_ADDR} ^
74\.52\.58\.162
RewriteRule ^.*$ - [F]

Done!  Now this cheesy spammer selling cheesy black hat products (mind you, they wouldn't even work), can't steal my content anymore.  Good riddance :)

Now, this won't get rid of every spammer; some are persistent — or more sophisticated. But taking some time out to eliminate a few of these guys is worthwhile.

Tell an amigo:
  • Sphinn
  • Digg
  • Reddit
  • del.icio.us
  • StumbleUpon
  • Facebook



Related posts:
Finding Spammers' Hideouts RSnake of ha.ckers.org documents in this post how to conveniently...
Patched Sociable Code To Enable Feed Icons I said I would release this if enough people asked. ...
Noindex, follow for RSS Feeds? Guest post by Joost de Valk. Every now and then...
Wikipedia Is Link Nazi Of 2006 No Links For You! I wrote the Wikipedia page on...
Matt Cutts Gems: Part III This is a summary of the videos located at http://www.mattcutts.com/blog/video-reinclusion-requests/...




"39 Wise Comments Banged Out Somewhere On The Internet ..."


Robert

Alternatively, to curl up even less unproductive work, add this line to .htaccess:

Deny from 74.52.58.162

Which would even allow you to block a whole range of IP addresses in case it proves necessary…

Ross M Karchner

hmmm, there's a lot of other evil fun that could be had with this, for instance redirecting to a feed from one of their competitors, or a special feed with a single or repeating "please don't steal" message.

Wayne Smallman

I had someone in China kindly scour my entire 'blog over the course of yesterday.

498 page views, which isn't bad to say I only have 133 articles in there…

Jonny T

What a joke this spammer is! If you check out their homepage, they promote "Doorway Cloaking Scripts"…"designed to drive massive traffic to any web site". What a crock!

Jeremy Luebke

Most of these guys are using a site that provides feeds also. So if you want to be really mean, you could redirect them to eat their own feed creating a loop ;)

Gerry Grant

Can you modify this to stop ongoing click fraud from the same IP address? Or Ross' idea have them go to their own paid ads. I will be thinking of this for a while.

2007 the year of Search Optimization 2.0

What does that mean to you? I am taking a survey.

TBP

Big Deal out of Nothing….

Check this out: http://trafficboosterpro.com/news/index.php?what=all&how=paged&_tbredir=1

It's the January 2nd 2007 updates. So your .htaccess doesn't work (not for TrafficBoosterPro).

So why you don't change the article title to:

How You Can NOT Stop Dirty Feed Scrapers In 3 Easy Steps.

You have to do better than that. Besides you have to know that in all postings there is a source link (back to your site) . So this is NOT stealing. If you use a Newspaper article in your Blog and you mention the source is not stealing. 99% of Blogs are doing so everyday.

Any way instead of saying Bul****t here why you don't just email the site admin to remove your RSS feed from there?

It's simple and not philosophy. The web site you mentioned has a contact form. Why you don't use it? It seams that you like to get some clicks from there, right?

It seams to me that when God was raining minds …some people was using their umbrellas… ;-))

Rob

If you want to stop ongoing click fraud from the same address, try this little doozy. It will not display ads to any of the IPs you place in the script.

Jaimie Sirovich

TBP,

I didn't actually apply these changes yet :) I wanted to see if you were doing it semi-manually or automatically. I thought it would be pretty funny if you posted this post on your scraper blog.

No hard feelings.

J.

phaithful

Wow, TBP for a splogger he sure is defensive. If you're going to succeed at ripping off content… you've got to have thicker skin than that.

I can't believe Jimmy Saunders is promoting cloaking and doorway generator software while syndicating content from Matt Cutts. Aren't you mixing the marketing message there? Don't you have something better to do with your life than waste it away promoting dated black hat software that won't ever work?

Ryan

grep "stuff to search" file.log

Mark Brandon

Yeah, but do you want to block these scrapers? Why syndicate at all if you don't want people to consume your content?

Not to defend the splogger above, but if he links back, then you benefit from link popularity. Google recognizes duplicate content, so the splog site should not ever outrank you.

I've had some tools which republishes blogger headlines (with their permission). The link popularity has benefited the SEO enormously.

Jaimie Sirovich

Mark,

Consuming is one thing. Lifting is another. Just because we provide XML feeds does not mean the content is free. It's under copyright implicitly unless we state otherwise.

Furthermore he nofollowed those links (I didn't mention that). Antisocial behavior like that has to stop. It's just plain theft.

Jaimie Sirovich

Besides you have to know that in all postings there is a source link (back to your site) . So this is NOT stealing. If you use a Newspaper article in your Blog and you mention the source is not stealing. 99% of Blogs are doing so everyday.

– No; it's legal to excerpt material for "fair use." It is never legal to take an article in its entirety without permission.

It seams to me that when God was raining minds … some people was using their umbrellas … ;-) )

– 2 spelling/grammar errors while attacking my intelligence? I have to laugh.

Anyway, it seems that people have a major misconception of what fair use is. I'm not a lawyer — and I can't tell you what it is, but I can tell you what it isn't.

Lifting entire articles is not fair use!

Stace

Jamie, don't throw rocks from the glass house — I don't think you'll find riddens in any dictionary.

Does your book have an editor? Hope so.

(Note: Spelling error fixed. Thanks for pointing it out.)

Steve

Not to harp, but "riddens" is spelled "riddance".

As in, good riddance to bad spelling.

http://www.urbandictionary.com/define.php?term=good+riddance

Ajay

Robert's solution is the easiest. I've been adding the deny from for a lot of IP address, both scrapers and leechers.

John Doe

Bah, just because it is not in the dictionary does not mean riddens is not a word. Slang is part of modern english and that you can identify its root, supports covernsion of the word.

Mark, you are comparing apple and oranges. Stealing is stealing, go away please.

Andrea Micheloni

Well.. thanks, what about making a plugin for that? ;)

rush

there are easy workarounds for your solution. it's like trying to stop software piracy, it will never happen so why even try waste your time and energy?

Jaimie Sirovich

Perhaps, but Microsoft has been very successful with Windows activation. It won't stop everyone, but it might make them move on to an easier target.

And, seriously, why swipe content from Matt Cutts? That's the reason I singled him out.

Jonathan

There's one problem with this approach though: what if you're using FeedBurner? A lot of people use FeedBurner to manage their feed so this approach would not work.

I'm still looking for a fool proof method to get around this problem of scrapping while still providing full context feeds…

Matt Sandy

To be honest, what I think you should do is after it pulls the feed have part of the content include a comment tag with the IP of who they are, so when they repost, you know if they are even doing it from the same server as their site. THEN make them publish some tub girl or something as apposed to actual content.

oral seymour

thanks for the advice. I'm having the same problem with some of my clients. I think thats why his entire site ends up in supplemental index.

TBP

Wow,

I didn't think that this could go this far. Fisrt of all Happy new year to every one!

Let me tell you one thing. By selecting some feeds for a blog it's because they worth it.

Now, by reading some of the posts here I see that only "Mark Brandon" has got the real point.

I won't let you know about the circulation that has my RSS feed and how many people are using it on their web sites.

Perhaps you know about SEO but it doesn't seems that you know about RSS that much. (I don't say that to hurt your feelings — perhaps you do know what RSS is and how fast several sites can start using your feed in their pages).

All I'm saying is that the same posts from my site are printed also on thousand other sites (sites that do not belong to me). You see I re-publish my blog using Feed burner…

So people who happened to find these posts in all those sites are end up to the source. AND THAT'S YOUR SITE.

So Jaimie make a formal request to the web site's contact form to remove your content and this will be done within 24 hours of your request. You don't need to write .htaccess files and try to figure out my methods or mark my software.

Cause Jaimie you made some replies here but you still do not say anywhere that you want your RSS feed to be removed.

By reading this:
———————-
" I didn’t actually apply these changes yet :) I wanted to see if you were doing it semi-manually or automatically. I thought it would be pretty funny if you posted this post on your scraper blog.

No hard feelings."

And by reading this:
————————
"Done! Now this cheesy spammer selling cheesy black hat products (mind you, they wouldn't even work), can't steal my content anymore. Good riddance"

Makes me Wonder which one is true… You tried to block me and you failed? You Trying to be smart on programming skills that didn't work? Or maybe you just wanted to point out that my software doesn't work? ;-)

Either way I don't realy care for opinions of people that are NOT my customers. :-) How can you judge something that you haven't even try? (maybe you will say… because it's a black hat tool. Then just say that. And not that the software doesn't work cause you are not one of my customers and since I do not give my software for free then you haven't try it and for sure you can't express such an opinion).

Now about "phaithful" he says:
—————————-
"Wow, TBP for a splogger he sure is defensive. If you’re going to succeed at ripping off content… you’ve got to have thicker skin than that."

I think that the above (and below) answers to Jaimie does answer to your posting also phaithful.

And about that:
"I can’t believe Jimmy Saunders is promoting cloaking and doorway generator software while syndicating content from Matt Cutts. Aren’t you mixing the marketing message there? Don’t you have something better to do with your life than waste it away promoting dated black hat software that won’t ever work?"

How can you make such an assumption(that software don't work) my friend. You are not even a customer of mine…

As I said I only use RSS feeds from web sites that I respect and because they worth it to be published. (and re-published). Now about the part of what I do with my life… as you said it's my life and not yours. So let's not start judging each other's life here. Who are you to judge me? I'm not suggesting to you to change life style or purpose to life. If you are happy with your life… well I'm happy with mine too. So thanks but I will pass… Insult passed as if unnoticed…

Mutt is a person that I respect and I'm sure that he is aware of my site. I could publish only Black Hat posts but I think that people and especially my customers that I respect also very much has the right to know all sides and choose.

They can choose between a software that do things automatically and might work or not work (there is a 30 days Money back guarantee for that) and to know all the risks (by reading Mutt's content) and they also can choose to be scammed by "White" Hat SEO's wanna be, that sell to them High rankings and traffic and Links from link farms.

I don't say that to offend here anybody (in this blog) but I'm sure that you are all aware of people that does that and sell such SEO services for a few (or more) bucks. At least I sell a software that works (and there is not even one negative complain from a "REAL" customer that says otherwise for the last almost 2 years that I sell this software online). And there are customers that use it in a "let's say nice way" and there are customers that do spam.

The knifes can be used to cook or can be misused. Ex. for a kill. The guns also and so on…

Doorway and cloaking software is not welcome by many because it automates and does things in seconds that a team of 100 SEO's can do manually in many hours of hard work.

My software (especially the new version) can optimize a page in a blink of an eye. And it can do it in many different languages including japanese, Korean, etc. (about 20 languages). It is also an income tool cause it shows Adsense, Amazon, Clickbank, Ebay auction products that match the main keyword or keyword phrase on each page. A user of my software doesn't only buy it to redirect his traffic (which is optional) but also to sell advertisement on their pages or promote affiliate products, without redirecting at all.

Things that can never be done by one person or even a team by hand. My software can create a complete web site from scratch, in seconds. All pages 100% optimized.

And believe me it takes good SEO and marketing knowledge to write a tool that automates SEO and marketing work. (even if it's black and white and grey and other colors of SEO used in this software). I'm sure that many of you has other opinion about Doorway and cloaking software. I respect every one's opinion. But this is not the issue here.

I don't sell Black Hat SEO services to my customers and most of all I don't lie to them. I sell a software that does what it says. It is a 100% LEGAL software.

Don't take this the wrong way. I don't want to say that I'm the best in what I do. Maybe I'm and maybe I'm not.

I let the others judge my work (those who are familiar with it, ex. my customers and not "thin skin" people who just post insults just to take a link back to their sites from this blog and play smart giving directions in life…) and time will tell…

Any way the one thing that must come out of this post is that I use RSS feeds from sites that knows what they write about and I mention the source and do not use this RSS feeds in an external splog to drive traffic to my site but I use it inside my own web site even if this gives the picture of splog.

From my side of view I do not steal an RSS feed that is publicly given. I give a link to the source so thousand other people can visit the source (which benefits the source site) and "yes"… I use nofollow in the links. But this has to do with pagerank and spiders.

Each RSS feed may have hundrends of links back to the source web site. It is something that it can't be controlled and you all know that.

I don't think that anybody here needs SEO 101 lessons. But you all know what can happened to a web site with 50 or 100 or more links in one page. At least it will be "marked" as a link farm web site. And this is the reason for using the nofollow to the source web sites.

One more thing I would like to post here. In a couple of months (maybe sooner) I will start a SEO contest. I would like to see many of you there.

When the time comes, I will (if I'm still welcome) post here the details.

I will have the opportunity to gain SEO knowledge from people and their favorite SEO tools or work.

Also I will have the opportunity to see for my self that you are right, that my software doesn't really work as Jaimie and phaithful says and strongly supports. I don't thing that you guys wouldn't like to be the first to join this contest to prove your point? Right? (that my software doesn't work?)…

No hard feelings Jaimie.

It will be a black and white and grey and pink and many other color SEO contest, open for everybody to join.

And Jaimie thanks for posting my answers here. Perhaps I do have some misspelling but I was in a hurry (I'm in the middle of updating my web site for the new version of my software and I can't join this warm company here for a while until I finish the updates).

Hope to see everybody in the SEO contest though.

Jaimie Sirovich

My primary point is that RSS feeds do not make content public domain. Copyright law does not magically change when you publish something in XML. The burden of opting out of your syndication is not mine. It's illegal to do it in the first place.

You are breaking the law, plain and simple.

TBP

And about this:

"I thought it would be pretty funny if you posted this post on your scraper blog.

No hard feelings.

J."

It's there. Go to http://trafficboosterpro.com/blog/how-you-can-not-stop-dirty-feed-scrapers-in-3-easy-steps.html

Regards

TBP

Jaimie Says:

"My primary point is that RSS feeds do not make content public domain. Copyright law does not magically change when you publish something in XML. The burden of opting out of your syndication is not mine. It’s illegal to do it in the first place.

You are breaking the law, plain and simple."

Your Feed is gone from my site. It will not be syndicated in the future again.

But since you mention illegal activities you should know that defamation of a product or service or website it is also illegal and therefor another negative comment about my software from you or any of your posters (that are not my customers) will not be left unnoticed cause:

You are also breaking the law, plain and simple. ;-)

Regards

Andy Beard

Adding nofollow isn't just to stop alarm bells, it is to circumvent duplicate content and original source algorithms.

You can't block anyone who really wants to splog your content, all they have to do is add it to Google reader, automatically add a tag to the feed, and then reblog / splog it.

Until Google Reader has an option for preventing sharing, such as a "noshare" tag for feeds, it will remain one of the most powerful splogging tools.

oepiru

TBP is such a naughty stealer, and does not even apologize or at least shut the f. up

alexf2000

Nice post, you got diggularity level 34 for it. :)

Weblog Tools Collection » Blog Archive » How You Can Stop Dirty Feed Scrapers In 3 Easy Steps

[...] How You Can Stop Dirty Feed Scrapers In 3 Easy Steps takes you through three steps to identifying feed scrapers and blocking them using .htaccess. An easier method the author misses and is pointed out in the first comment is just using deny from instead of the RewriteRule. (No Ratings Yet)  Loading … [...]

Wie man Contentdiebe in drei Schritten sperren kann | bueltge.de [by:ltge.de]

[...] … beschreibt Jaimie Sirovich auf seinem Blog in wenigen Sätzen. Die drei Punkte sind schnell nachvollzogen und damit kann man die potenzielle Diebe aussperren. [...]

onmeco Blog » Newsfeed Diebe aussperren

[...] Newsfeeds sind dank Web 2.0 voll im Trend. Leider gibt es viele (faule) Webmaster, die auf diese Weise Content für die eigene Webseite beschaffen.Wir nutzen ja bekanntlich Wordpress, hier wird automatisch der Newsfeed (xml Datei) erzeugt. Böse Webmaster nutzen genau diese xml Datei um an den fremden Content zu gelangen. Für Wordpress gibt es ein Plugin, das automatisch ein Copyright Hinweis hinzufügen kann. Benutzer dieses Plugins können ihre URL angeben und auch einen individuellen Copyright Vermerk, der im Feed mit übertragen wird. An dieser Stelle möchte ich auf Jaimie Sirovich Tricks hinweisen. Sirovich hat in seinem Blog prima erklärt , wie man Newsfeed-Diebe "stoppen" kann. Voraussetzung hierbei ist, dass der Dieb beim Diebstahl und Veröffentlichung die gleiche IP Adresse hat. [...]

Ich glaub, ich werd wahnsinnig! » Blog Archive » Inhalt klau verhindern

[...] will ich aber nicht. Auf der Seite http://www.seogghead.com bin ich dann fündig [...]

Ich glaub, ich werd wahnsinnig! » Blog Archive » Inhalt Klau verhindern - Anleitung

[...] Anleitung auf der Seite http://www.seoegghead.com scheint geholfen zu haben. Hier noch meine Kurzanleitung was ich gemacht [...]

MAD KANE’S HUMOR BLOG » Blog Archive » Victory In My Battle Against Feed Scraping Content Thief 4Comedy.com

[...] 4. An article entitled How you can stop dirty feedscrapers in 3 easy steps; and [...]

Scraper Sites

[...] If your site or blog has been scraped and you do not want that site to continue to take your content, then I have found an easy enough solution for you on Jaime Sirovich's SEO blog. [...]

What if George Bush was an SEO?

[...] Scraping useful content from other sites is okay, but stealing it is wrong.. [...]



Care To Bang On The Keys ... ?

BECOME AN EGGHEAD. SUBSCRIBE TO OUR RSS FEED!

Learn to be as nerdy as we are by never missing our latest blog entries. Receive great tips, tricks, and ideas on improving your web site every day! Subscribe via our RSS Feed or use the chicklets in the sidebar.