Question

我正在用PHP编写一个网站，汇集来自其他各种网站的数据。我有一个函数'returnPageSource'，它接受一个URL并从该URL返回html作为字符串。

function returnPageSource($url){
    $ch = curl_init();
    $timeout = 5;   // set to zero for no timeout       

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);     // means the page is returned
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOUT_CONNECTTIMEOUT, $timeout); // how long to wait to connect
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);     // follow redirects
    //curl_setopt($ch, CURLOPT_HEADER, False);          // only request body

    $fileContents = curl_exec($ch); // $fileContents contains the html source of the required website
    curl_close($ch);

    return $fileContents;
}

这适用于我需要的一些网站，例如 http://atensembl.arabidopsis.info/Arabidopsis_thaliana_TAIR/unisearch?species=Arabidopsis_thaliana_TAIR;idx=;q=At5g02310，但不适用于其他人，http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&modeInput=Absolute&primaryGene=At5g02310&orthoListOn=0。有人知道为什么吗？

更新

感谢您的回复。我已将我的useragent改为与我的浏览器相同（Firefox 3，可以访问网站正常），将超时更改为0，我仍然无法连接，但我可以收到一些错误消息。 curl_error（）给出了错误“无法连接到主机”和curl_getinfo（$ ch，CURLINFO_HTTP_CODE）;返回HTTP代码0 ......这两者都没有用。我也尝试过curl_setopt（$ ch，CURLOPT_VERBOSE，1）;但是没有显示任何内容。有没有人有其他想法？

最终更新

我刚刚意识到我没有解释什么是错的 - 我只需要输入我大学的代理设置（我正在使用大学的服务器）。之后一切都很好！

Answer 1

您应该使用curl_error()来检查发生了哪个错误（如果有的话）

Answer 2

我假设您已尝试将超时设置为0。

这些网站返回了哪些HTTP状态代码？检查curl_getinfo($ch, CURLINFO_HTTP_CODE);。

要尝试的其他内容可能是欺骗User-Agent标头，可能是您自己的浏览器，因为您知道可以访问这些页面。他们可能只是试图阻止机器人访问该页面。

调查标题和http代码应该会为您提供更多信息。

修改

我对此进行了一些调查。有一件事是你的连接超时错误 - 应该是CURLOPT_CONNECTTIMEOUT。

无论如何，我运行了这个脚本（下面），它返回了你想要的东西（我想）。检查它和你的之间有什么不同。如果它有用，我正在使用PHP 5.2.8。

<?php $addresses = array( 'http://atensembl.arabidopsis.info/Arabidopsis_thaliana_TAIR/unisearch?species=Arabidopsis_thaliana_TAIR;idx=;q=At5g02310', 'http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&modeInput=Absolute&primaryGene=At5g02310&orthoListOn=0' ); foreach ($addresses as $address) { echo "Address: http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&modeInput=Absolute&primaryGene=At5g02310&orthoListOn=0\n"; // This box doesn't have http registered as a transport layer - pfft //var_dump(fsockopen($address, 80)); $ch = curl_init($address); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5); $fc = curl_exec($ch); echo "Info: " . print_r(curl_getinfo($ch), true) . "\n"; echo "$fc\n"; curl_close($ch); }

返回以下内容（TL; DR：我的cURL可以正常读取页面）：

C:\Users\Ross>php -e D:\sandbox\curl.php Address: http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&modeInput=Absolute&primaryGene=At5g02310&orthoListOn=0 Info: Array ( [url] => http://atensembl.arabidopsis.info/Arabidopsis_thaliana_TAIR/unisearch?species=Arabidopsis_thaliana_TAIR;idx=;q=At5g02310 [content_type] => text/html; charset=ISO-8859-1 [http_code] => 200 [header_size] => 168 [request_size] => 151 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 0.654 [namelookup_time] => 0.004 [connect_time] => 0.044 [pretransfer_time] => 0.044 [size_upload] => 0 [size_download] => 7531 [speed_download] => 11515 [speed_upload] => 0 [download_content_length] => 0 [upload_content_length] => 0 [starttransfer_time] => 0.57 [redirect_time] => 0 ) <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-gb" lang="en-gb"> <head> <title>AtEnsembl release 49: Arabidopsis thaliana TAIR EnsEMBL UniSearch results</title> <style type="text/css" media="all"> @import url(/css/ensembl.css); @import url(/css/content.css); </style> <style type="text/css" media="print"> @import url(/css/printer-styles.css); </style> <style type="text/css" media="screen"> @import url(/css/screen-styles.css); </style> <script type="text/javascript" src="/js/protopacked.js"></script> <script type="text/javascript" src="/js/core42.js"></script>  </body> </html> Address: http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&modeInput=Absolute&primaryGene=At5g02310&orthoListOn=0 Info: Array ( [url] => http://www.bar.utoronto.ca/efp/cgi-bin/efpWeb.cgi?dataSource=Chemical&modeInput=Absolute&primaryGene=At5g02310&orthoListOn=0 [content_type] => text/html; charset=UTF-8 [http_code] => 200 [header_size] => 146 [request_size] => 155 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 2.695 [namelookup_time] => 0.004 [connect_time] => 0.131 [pretransfer_time] => 0.131 [size_upload] => 0 [size_download] => 14156 [speed_download] => 5252 [speed_upload] => 0 [download_content_length] => 0 [upload_content_length] => 0 [starttransfer_time] => 2.306 [redirect_time] => 0 ) <html> <head> <title>Arabidopsis eFP Browser</title> <link rel="stylesheet" type="text/css" href="efp.css"/> <link rel="stylesheet" type="text/css" href="domcollapse.css"/> <script type="text/javascript" src="efp.js"></script> <script type="text/javascript" src="domcollapse.js"></script> </head> <body>  </body> </html>

那么这意味着什么？不完全确定。我怀疑他们是否专门阻止你（因为你可以访问该页面，除非你在网络服务器上运行这个脚本）。尝试运行我上面的代码 - 如果可行的话，请尝试注释掉代码的一部分，看看有什么不同（并可能导致停止）。你还运行什么PHP版本？

Answer 3

需要考虑两件事。

首先是你将超时设置为低。这些网站上的请求可能需要超过5秒的时间。

第二，有问题的网站可能故意阻止您的请求。他们有一个规则来阻止来自curl的请求，或者他们可能已经注意到来自您的IP地址的可疑活动（屏幕抓取或其他人的网络滥用）并阻止/限制请求。

为什么使用CURL的这个函数适用于某些URL而不适用于其他URL？

3 个答案: