我正在教自己一些基本的刮擦,我发现有时我输入到我的代码中的URL返回404,这会使我的所有其余代码变得粗糙.
所以我需要在代码顶部进行测试,以检查URL是否返回404.
这似乎是一项相当直接的任务,但谷歌没有给我任何答案.我担心我在寻找错误的东西.
一篇博客推荐我用这个:
$valid = @fsockopen($url, 80, $errno, $errstr, 30);
然后测试以查看$ valid是否为空.
但我认为给我问题的URL有一个重定向,所以$ valid对所有值都是空的.或许我做错了什么.
我也查看了"头部请求",但我还没有找到任何可以使用或尝试的实际代码示例.
建议?这是关于卷曲的?
如果您使用的是PHP的curl
绑定,则可以使用以下方法检查错误代码curl_getinfo
:
$handle = curl_init($url); curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE); /* Get the HTML or whatever is linked in $url. */ $response = curl_exec($handle); /* Check for 404 (file not found). */ $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); if($httpCode == 404) { /* Handle 404 here. */ } curl_close($handle); /* Handle $response here. */
如果您运行的php5可以使用:
$url = 'http://www.example.com'; print_r(get_headers($url, 1));
或者使用php4,用户提供了以下内容:
/** This is a modified version of code from "stuart at sixletterwords dot com", at 14-Sep-2005 04:52. This version tries to emulate get_headers() function at PHP4. I think it works fairly well, and is simple. It is not the best emulation available, but it works. Features: - supports (and requires) full URLs. - supports changing of default port in URL. - stops downloading from socket as soon as end-of-headers is detected. Limitations: - only gets the root URL (see line with "GET / HTTP/1.1"). - don't support HTTPS (nor the default HTTPS port). */ if(!function_exists('get_headers')) { function get_headers($url,$format=0) { $url=parse_url($url); $end = "\r\n\r\n"; $fp = fsockopen($url['host'], (empty($url['port'])?80:$url['port']), $errno, $errstr, 30); if ($fp) { $out = "GET / HTTP/1.1\r\n"; $out .= "Host: ".$url['host']."\r\n"; $out .= "Connection: Close\r\n\r\n"; $var = ''; fwrite($fp, $out); while (!feof($fp)) { $var.=fgets($fp, 1280); if(strpos($var,$end)) break; } fclose($fp); $var=preg_replace("/\r\n\r\n.*\$/",'',$var); $var=explode("\r\n",$var); if($format) { foreach($var as $i) { if(preg_match('/^([a-zA-Z -]+): +(.*)$/',$i,$parts)) $v[$parts[1]]=$parts[2]; } return $v; } else return $var; } } }
两者的结果都类似于:
Array ( [0] => HTTP/1.1 200 OK [Date] => Sat, 29 May 2004 12:28:14 GMT [Server] => Apache/1.3.27 (Unix) (Red-Hat/Linux) [Last-Modified] => Wed, 08 Jan 2003 23:11:55 GMT [ETag] => "3f80f-1b6-3e1cb03b" [Accept-Ranges] => bytes [Content-Length] => 438 [Connection] => close [Content-Type] => text/html )
因此,您可以检查标题响应是否正常,例如:
$headers = get_headers($url, 1); if ($headers[0] == 'HTTP/1.1 200 OK') { //valid } if ($headers[0] == 'HTTP/1.1 301 Moved Permanently') { //moved or redirect page }
W3C代码和定义
使用strager的代码,您还可以检查CURLINFO_HTTP_CODE以获取其他代码.有些网站不报告404,而是简单地重定向到自定义404页面并返回302(重定向)或类似的东西.我用它来检查服务器上是否存在实际文件(例如robots.txt).很明显,这种文件如果存在则不会导致重定向,但如果不存在,则会重定向到404页面,正如我之前所说,可能没有404代码.
function is_404($url) { $handle = curl_init($url); curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE); /* Get the HTML or whatever is linked in $url. */ $response = curl_exec($handle); /* Check for 404 (file not found). */ $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE); curl_close($handle); /* If the document has loaded successfully without any redirection or error */ if ($httpCode >= 200 && $httpCode < 300) { return false; } else { return true; } }
正如strager所暗示的那样,请考虑使用cURL.您可能还有兴趣使用curl_setopt设置CURLOPT_NOBODY 以跳过下载整个页面(您只需要标题).
如果您正在寻找一个最简单的解决方案,那么您可以尝试使用php5
file_get_contents('www.yoursite.com'); //and check by echoing echo $http_response_header[0];
我在这里找到了这个答案:
if(($twitter_XML_raw=file_get_contents($timeline))==false){ // Retrieve HTTP status code list($version,$status_code,$msg) = explode(' ',$http_response_header[0], 3); // Check the HTTP Status code switch($status_code) { case 200: $error_status="200: Success"; break; case 401: $error_status="401: Login failure. Try logging out and back in. Password are ONLY used when posting."; break; case 400: $error_status="400: Invalid request. You may have exceeded your rate limit."; break; case 404: $error_status="404: Not found. This shouldn't happen. Please let me know what happened using the feedback link above."; break; case 500: $error_status="500: Twitter servers replied with an error. Hopefully they'll be OK soon!"; break; case 502: $error_status="502: Twitter servers may be down or being upgraded. Hopefully they'll be OK soon!"; break; case 503: $error_status="503: Twitter service unavailable. Hopefully they'll be OK soon!"; break; default: $error_status="Undocumented error: " . $status_code; break; }
实质上,您使用"文件获取内容"方法来检索URL,该URL会自动使用状态代码填充http响应标头变量.