PHP CURL检索部分页面

时间:2015-06-30 12:42:44

标签: php html curl web-crawler

我有以下CURL代码:

$ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $url);
if ($postParameters != '') {
    curl_setopt($ch, CURLOPT_POST, TRUE);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $postParameters);
}
curl_setopt($ch, CURLOPT_COOKIEFILE, __DIR__.'cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEJAR, __DIR__.'/cookie.txt');
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); 
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_TIMEOUT, 60); 
curl_setopt($ch, CURLOPT_REFERER, $referer);
$pageResponse = curl_exec($ch); 
curl_close($ch); 

当我尝试获取页面时,大部分时间我都会收到我要求的整个页面。但是,我不时会得到页面的一部分,例如:

  

DOCTYPE html PUBLIC“ - // W3C // DTD XHTML 1.0 Transitional // EN”   “http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd” > HTML   xmlns =“http://www.w3.org/1999/xhtml”dir =“ltr”lang =“en”>头>     meta http-equiv =“Content-Type”content =“text / html;   charset = windows-1251“/> meta name =”generator“content =”

我删除了“<”在标签前面,HTML代码将显示在堆栈交换中。 有谁知道为什么突然停止接收?我注意到数据经常在打开双引号后突然停止(即content =“或username =”)。不确定100%是否总是以这种方式发生。无论如何,这可能是编码问题吗? 还有其他想法吗?

任何帮助都将不胜感激。

1 个答案:

答案 0 :(得分:0)

You can try to add some debugging.

Add these options:

curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_STDERR,$f = fopen(__DIR__ . "/error.log", "w+"));

And these before curl_close():

if($errno = curl_errno($ch)) {
    $error_message = curl_strerror($errno);
    echo "cURL error ({$errno}):\n {$error_message}";
}

If that doesn't work try increasing the timeout and see if it goes away:

curl_setopt($ch, CURLOPT_TIMEOUT, 300); 

If the timeout increase works, then find out why.