与mail.google.com,cURL和http://validator.w3.org/checklink混淆

时间:2009-06-13 18:44:27

标签: php http curl

我正在使用cURL构建基本链接检查器。我的应用程序有一个名为getHeaders()的函数,它返回一个HTTP头数组:

function getHeaders($url) {

    if(function_exists('curl_init')) {
        // create a new cURL resource
        $ch = curl_init();
        // set URL and other appropriate options
        $options = array(
            CURLOPT_URL => $url,
            CURLOPT_HEADER => true,
            CURLOPT_NOBODY => true,
            CURLOPT_FOLLOWLOCATION => 1,
            CURLOPT_RETURNTRANSFER => true );
        curl_setopt_array($ch, $options);
        // grab URL and pass it to the browser
        curl_exec($ch);
        $headers = curl_getinfo($ch);
        // close cURL resource, and free up system resources
        curl_close($ch);
    } else {
        echo "

Error: cURL is not installed on the web server. Unable to continue.

"; return false; } return $headers; } print_r(getHeaders('mail.google.com'));

产生以下结果:

Array
(
    [url] => http://mail.google.com
    [content_type] => text/html; charset=UTF-8
    [http_code] => 404
    [header_size] => 338
    [request_size] => 55
    [filetime] => -1
    [ssl_verify_result] => 0
    [redirect_count] => 0
    [total_time] => 0.128
    [namelookup_time] => 0.042
    [connect_time] => 0.095
    [pretransfer_time] => 0.097
    [size_upload] => 0
    [size_download] => 0
    [speed_download] => 0
    [speed_upload] => 0
    [download_content_length] => 0
    [upload_content_length] => 0
    [starttransfer_time] => 0.128
    [redirect_time] => 0
)

我已经使用多个长链接对其进行了测试,并且该函数确认了重定向,除了mail.google.com之外,它还有。

为了好玩,我将相同的网址(mail.google.com)传递给W3C链接检查程序,该检查程序生成:

Results

Links

Valid links!

List of redirects

The links below are not broken, but the document does not use the exact URL, and the links were redirected. It may be a good idea to link to the final location, for the sake of speed.

warning Line: 1 http://mail.google.com/mail/ redirected to

https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl&bsv=zpwhtygjntrz&scc=1<mpl=default<mplcache=2

Status: 302 -> 200 OK

This is a temporary redirect. Update the link if you believe it makes sense, or leave it as is. 

Anchors

Found 0 anchors.

Checked 1 document in 4.50 seconds.

这是正确的,因为当我将mail.google.com输入浏览器时,上面的地址是我被重定向到的地方。

我需要使用哪些cURL选项让我的函数为mail.google.com返回200?

为什么上面的函数返回404状态代码而不是302状态代码?

TIA

2 个答案:

答案 0 :(得分:4)

问题是重定向是通过cURL不会遵循的方法指定的。

以下是来自http://mail.google.com的回复:

HTTP/1.1 200 OK
Cache-Control: public, max-age=604800
Expires: Mon, 22 Jun 2009 14:58:18 GMT
Date: Mon, 15 Jun 2009 14:58:18 GMT
Refresh: 0;URL=http://mail.google.com/mail/
Content-Type: text/html; charset=ISO-8859-1
X-Content-Type-Options: nosniff
Transfer-Encoding: chunked
Server: GFE/1.3

<html>
 <head>
  <meta http-equiv="Refresh" content="0;URL=http://mail.google.com/mail/" />
 </head>
 <body>
  <script type="text/javascript" language="javascript">
  <!--
   location.replace("http://mail.google.com/mail/")
  -->
  </script>
 </body>
</html>

正如您所看到的,该页面使用了Refresh标头(和等效的HTML元数据)和正文中的javascript来将位置更改为http://mail.google.com/mail/

如果您再请求http://mail.google.com/mail/,您将被重定向(使用位置标题,后面跟着cURL)到您之前提到的W3C正确识别的页面。

HTTP/1.1 302 Moved Temporarily
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Fri, 01 Jan 1990 00:00:00 GMT
Date: Mon, 15 Jun 2009 15:07:56 GMT
Location: https://www.google.com/accounts/ServiceLogin?service=mail&passive=true&rm=false&continue=http%3A%2F%2Fmail.google.com%2Fmail%2F%3Fui%3Dhtml%26zy%3Dl&bsv=zpwhtygjntrz&scc=1&ltmpl=default&ltmplcache=2
Content-Type: text/html; charset=UTF-8
X-Content-Type-Options: nosniff
Transfer-Encoding: chunked
Server: GFE/1.3

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Cache-control: no-cache, no-store
Pragma: no-cache
Expires: Mon, 01-Jan-1990 00:00:00 GMT
Set-Cookie: GALX=B8zH60M78Ys;Path=/accounts;Secure
Date: Mon, 15 Jun 2009 15:07:56 GMT
X-Content-Type-Options: nosniff
Content-Length: 19939
Server: GFE/2.0

(HTML page content here, removed)

也许您应该在脚本中添加一个额外的步骤来检查Refresh标头。

另一个可能的错误是您在PHP配置中设置了open_basedir,这将禁用CURLOPT_FOLLOWLOCATION - 您可以通过启用错误报告来快速检查,因为消息会生成为警告或通知。

以上结果均通过以下cURL设置获得:

$useragent="Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$res = curl_exec($ch);

curl_close($ch);

答案 1 :(得分:0)

可能是那个

mail.google.com -> mail.google.com/mail is a 404 and then a hard redirect

mail.google.com/mail -> https://www.google.com/accounts... etc is a 302 redirect