我正在使用cURL构建基本链接检查器。我的应用程序有一个名为getHeaders()的函数,它返回一个HTTP头数组:
function getHeaders($url) { if(function_exists('curl_init')) { // create a new cURL resource $ch = curl_init(); // set URL and other appropriate options $options = array( CURLOPT_URL => $url, CURLOPT_HEADER => true, CURLOPT_NOBODY => true, CURLOPT_FOLLOWLOCATION => 1, CURLOPT_RETURNTRANSFER => true ); curl_setopt_array($ch, $options); // grab URL and pass it to the browser curl_exec($ch); $headers = curl_getinfo($ch); // close cURL resource, and free up system resources curl_close($ch); } else { echo "Error: cURL is not installed on the web server. Unable to continue.
"; return false; } return $headers; } print_r(getHeaders('mail.google.com'));
产生以下结果:
Array ( [url] => http://mail.google.com [content_type] => text/html; charset=UTF-8 [http_code] => 404 [header_size] => 338 [request_size] => 55 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 0.128 [namelookup_time] => 0.042 [connect_time] => 0.095 [pretransfer_time] => 0.097 [size_upload] => 0 [size_download] => 0 [speed_download] => 0 [speed_upload] => 0 [download_content_length] => 0 [upload_content_length] => 0 [starttransfer_time] => 0.128 [redirect_time] => 0 )
我已经使用多个长链接对其进行了测试,并且该函数确认了重定向,除了mail.google.com之外,它还有。
为了好玩,我将相同的网址(mail.google.com)传递给W3C链接检查程序,该检查程序生成:
Results Links Valid links! List of redirects The links below are not broken, but the document does not use the exact URL, and the links were redirected. It may be a good idea to link to the final location, for the sake of speed. warning Line: 1 http://mail.google.com/mail/ redirected to https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl&bsv=zpwhtygjntrz&scc=1<mpl=default<mplcache=2 Status: 302 -> 200 OK This is a temporary redirect. Update the link if you believe it makes sense, or leave it as is. Anchors Found 0 anchors. Checked 1 document in 4.50 seconds.
这是正确的,因为当我将mail.google.com输入浏览器时,上面的地址是我被重定向到的地方。
我需要使用哪些cURL选项让我的函数为mail.google.com返回200?
为什么上面的函数返回404状态代码而不是302状态代码?
TIA
答案 0 :(得分:4)
问题是重定向是通过cURL不会遵循的方法指定的。
以下是来自http://mail.google.com的回复:
HTTP/1.1 200 OK
Cache-Control: public, max-age=604800
Expires: Mon, 22 Jun 2009 14:58:18 GMT
Date: Mon, 15 Jun 2009 14:58:18 GMT
Refresh: 0;URL=http://mail.google.com/mail/
Content-Type: text/html; charset=ISO-8859-1
X-Content-Type-Options: nosniff
Transfer-Encoding: chunked
Server: GFE/1.3
<html>
<head>
<meta http-equiv="Refresh" content="0;URL=http://mail.google.com/mail/" />
</head>
<body>
<script type="text/javascript" language="javascript">
<!--
location.replace("http://mail.google.com/mail/")
-->
</script>
</body>
</html>
正如您所看到的,该页面使用了Refresh标头(和等效的HTML元数据)和正文中的javascript来将位置更改为http://mail.google.com/mail/。
如果您再请求http://mail.google.com/mail/,您将被重定向(使用位置标题,后面跟着cURL)到您之前提到的W3C正确识别的页面。
HTTP/1.1 302 Moved Temporarily
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Date: Mon, 15 Jun 2009 15:07:56 GMT
Location: https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl&bsv=zpwhtygjntrz&scc=1<mpl=default<mplcache=2
Content-Type: text/html; charset=UTF-8
X-Content-Type-Options: nosniff
Transfer-Encoding: chunked
Server: GFE/1.3
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Cache-control: no-cache, no-store
Pragma: no-cache
Expires: Mon, 01-Jan-1990 00:00:00 GMT
Set-Cookie: GALX=B8zH60M78Ys;Path=/accounts;Secure
Date: Mon, 15 Jun 2009 15:07:56 GMT
X-Content-Type-Options: nosniff
Content-Length: 19939
Server: GFE/2.0
(HTML page content here, removed)
也许您应该在脚本中添加一个额外的步骤来检查Refresh标头。
另一个可能的错误是您在PHP配置中设置了open_basedir,这将禁用CURLOPT_FOLLOWLOCATION - 您可以通过启用错误报告来快速检查,因为消息会生成为警告或通知。
以上结果均通过以下cURL设置获得:
$useragent="Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$res = curl_exec($ch);
curl_close($ch);
答案 1 :(得分:0)
可能是那个
mail.google.com -> mail.google.com/mail is a 404 and then a hard redirect
和
mail.google.com/mail -> https://www.google.com/accounts... etc is a 302 redirect