Jun 9

Checking for Dead Links Automatically

Posted by Jaimie Sirovich on Jun. 9th, 2006. 2 comments — voice your opinion.

NEED A GREAT WEB SITE? NEED IT TO BE SEARCH-ENGINE-FRIENDLY?

SEO Egghead is a web development firm dedicated to creating custom, search engine optimized web site applications. We specialize in eCommerce and content management web sites that not only render information beautifully to the human, but also satisfy the "third browser" — the search engine. To us, search engines are people too. Click here to talk to us. We'd love to help!
X

This neat little class can return the HTTP status code of a URL.  It uses cURL to do so.  Simply take the result of "parseResponseCode"  and see if it's a 200.  Depending on your requirements, 302 or 301 may also be a satisfactory answer, or you may want to update the record (at least in the case of a 301), or recurse.  If the answer is a 404, you know you've found trouble. 

It's important to check for dead links, as too many of them can be detrimental to your site's ranking, not to mention it's annoying for the user. If it's too complicated (or perhaps impossible) to automatically remove them, simply email a log to yourself with the information and take care of it manually at interval.  Bill Slawski mentions in his blog the (detrimental) effect of "web decay."  I think this is important.

Below is the code in PHP.

<?

$LINKCHECKER_total_str '';

// +———————————————————————-+
// | LinkChecker                                                          |
// | Gets headers using Curl                                              |
// +———————————————————————-+
// | Copyright (c) 2003 Jaimie Sirovich                                   |
// +———————————————————————-+
// | Author: Jaimie Sirovich <jsirovic@gmail.com>                         |
// +———————————————————————-+

class LinkChecker
{

    function CURLOPT_WRITEFUNCTION($ch$str)
    {
        global 
$LINKCHECKER_total_str;
        
$LINKCHECKER_total_str .= $str;
        if (
preg_match('/^(.*?)\r\n\r\n/s'$LINKCHECKER_total_str$matches)) {
            echo 
$matches[1];
            return -
1;
        } else  {
            return 
strlen($str);
        }    
    }    
        
    function 
getHeader($url$userAgent "Mozilla/4.0")
    {
       global 
$LINKCHECKER_total_str;
       
$LINKCHECKER_total_str "";
       
ob_start();    
       
$ch curl_init();
       
curl_setopt ($chCURLOPT_URL$url);
       
curl_setopt ($chCURLOPT_USERAGENT$userAgent);
       
curl_setopt ($chCURLOPT_HEADER1);
       
curl_setopt ($chCURLOPT_RETURNTRANSFER1);
       
curl_setopt ($chCURLOPT_FOLLOWLOCATION1);
       
curl_setopt ($chCURLOPT_TIMEOUT60);
       
curl_setopt ($chCURLOPT_HEADER1);
       
       
curl_setopt ($chCURLOPT_WRITEFUNCTION, array("LinkChecker""CURLOPT_WRITEFUNCTION"));
       
       
$result curl_exec($ch);
       
curl_close($ch);
       return 
ob_get_clean();
    }
    
    function 
parseResponseCode($str) {
        
preg_match("/^HTTP\/\d\.\d (.{3})/"$str$matches);
        return 
$matches[1];
    }
    
    function 
parseMimeType($str) {
        
preg_match("/Content-type: (.*)/"$str$matches);
        return 
$matches[1];
    }
    
    function 
parseContentLength($str) {
        
preg_match("/Content-length: (.*)/"$str$matches);
        return 
$matches[1];    
    }
    
}

?>

Tell an amigo:
  • Sphinn
  • Digg
  • Reddit
  • del.icio.us
  • StumbleUpon
  • Facebook



Related posts:
A (not so simple) method to add rel="nofollow" to links I wrote this script so that I can run it...
How To Get Dugg More: Digg for WordPress Plugin So I got Dugg — but I'm convinced this trivial...
Automatically Highlighting Internal Links I find it very annoying when I click on a...
Code for HTML Auditing <? // +———————————————————————-+ // | HTMLParser                                                           | // | Simple HTML Parsing Library                                          | // | Based on Jose Solorzano's Library; his notice is below.              | // +———————————————————————-+ // | Portions Copyright (c) 2004-2005 Jaimie Sirovich                     | // +———————————————————————-+ // | This program is free software; you can redistribute it and/or        |...
Simple Cloak PHP Library Tell an amigo: ...




"2 Wise Comments Banged Out Somewhere On The Internet ..."


IncrediBILL

The only problem with basic link checking is many sites have overhauled their servers to deliver 'soft' 404 pages that actually return a 200 result.

Additionally, many sites lose their domain names and these become domain parks, pr0n sites, scrapers sites and much worse that will also return 200 results.

Not that the solution is trivial as I have over 400 fingerprints that I use to identify those types of pages or their redirect servers, it's a big mess.

Just thought I'd point that out so people don't think such a simple check will catch everything because you'll most likely be linking to bad neighborhoods that respond positively to a basic bare-bones link check.

SEO Egghead » Blog Archive » XSS for Lunch - Yum!

[...] First, we create a script that utilizes the last code-snippet I posted here that parses out the response codes from a HTTP document (LinkChecker.php), located here. [...]



Care To Bang On The Keys ... ?

BECOME AN EGGHEAD. SUBSCRIBE TO OUR RSS FEED!

Learn to be as nerdy as we are by never missing our latest blog entries. Receive great tips, tricks, and ideas on improving your web site every day! Subscribe via our RSS Feed or use the chicklets in the sidebar.