- Jun. 9th, 2006
- 2 comments
This neat little class can return the HTTP status code of a URL. It uses cURL to do so. Simply take the result of "parseResponseCode" and see if it's a 200. Depending on your requirements, 302 or 301 may also be a satisfactory answer, or you may want to update the record (at least in the case of a 301), or recurse. If the answer is a 404, you know you've found trouble.
It's important to check for dead links, as too many of them can be detrimental to your site's ranking, not to mention it's annoying for the user. If it's too complicated (or perhaps impossible) to automatically remove them, simply email a log to yourself with the information and take care of it manually at interval. Bill Slawski mentions in his blog the (detrimental) effect of "web decay." I think this is important.
Below is the code in PHP.
<?
$LINKCHECKER_total_str = '';
// +———————————————————————-+
// | LinkChecker |
// | Gets headers using Curl |
// +———————————————————————-+
// | Copyright (c) 2003 Jaimie Sirovich |
// +———————————————————————-+
// | Author: Jaimie Sirovich <jsirovic@gmail.com> |
// +———————————————————————-+
class LinkChecker
{
function CURLOPT_WRITEFUNCTION($ch, $str)
{
global $LINKCHECKER_total_str;
$LINKCHECKER_total_str .= $str;
if (preg_match('/^(.*?)\r\n\r\n/s', $LINKCHECKER_total_str, $matches)) {
echo $matches[1];
return -1;
} else {
return strlen($str);
}
}
function getHeader($url, $userAgent = "Mozilla/4.0")
{
global $LINKCHECKER_total_str;
$LINKCHECKER_total_str = "";
ob_start();
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt ($ch, CURLOPT_TIMEOUT, 60);
curl_setopt ($ch, CURLOPT_HEADER, 1);
curl_setopt ($ch, CURLOPT_WRITEFUNCTION, array("LinkChecker", "CURLOPT_WRITEFUNCTION"));
$result = curl_exec($ch);
curl_close($ch);
return ob_get_clean();
}
function parseResponseCode($str) {
preg_match("/^HTTP\/\d\.\d (.{3})/", $str, $matches);
return $matches[1];
}
function parseMimeType($str) {
preg_match("/Content-type: (.*)/", $str, $matches);
return $matches[1];
}
function parseContentLength($str) {
preg_match("/Content-length: (.*)/", $str, $matches);
return $matches[1];
}
}
?>
Related posts:
"2 Wise Comments Banged Out Somewhere On The Internet ..."
The only problem with basic link checking is many sites have overhauled their servers to deliver 'soft' 404 pages that actually return a 200 result. Additionally, many sites lose their domain names and these become domain parks, pr0n sites, scrapers sites and much worse that will also return 200 results. Not that the solution is trivial as I have over 400 fingerprints that I use to identify those types of pages or their redirect servers, it's a big mess. Just thought I'd point that out so people don't think such a simple check will catch everything because you'll most likely be linking to bad neighborhoods that respond positively to a basic bare-bones link check. SEO Egghead » Blog Archive » XSS for Lunch - Yum![...] First, we create a script that utilizes the last code-snippet I posted here that parses out the response codes from a HTTP document (LinkChecker.php), located here. [...]
|
















