尝试编写一个简单的爬虫方法。当我使用PHP curl获取 www.yahoo.com 页面时,我什么也没拿。我该如何获取页面? 我的代码如下。
public function getWebPage($url, $timeout = 120) {
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => "",
CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.19) Gecko/20081216 Ubuntu/8.04 (hardy) Firefox/2.0.0.19",
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => $timeout,
CURLOPT_TIMEOUT => $timeout,
CURLOPT_MAXREDIRS => 10,
);
$ch = curl_init($url);
curl_setopt_array($ch, $options);
$content = curl_exec($ch);
$err = curl_errno($ch);
$errmsg = curl_error($ch);
$header = curl_getinfo($ch);
curl_close($ch);
return $content;
}
答案 0 :(得分:1)
yahoo.com在安全套接字层上运行。因此,请将此cURL
参数添加到现有集合中。
CURLOPT_SSL_VERIFYPEER => false,
并禁用USERAGENT
..
<?php
class A
{
public function getWebPage($url, $timeout = 120) {
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => "",
//CURLOPT_USERAGENT => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.19) Gecko/20081216 Ubuntu/8.04 (hardy) Firefox/2.0.0.19",
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => $timeout,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_TIMEOUT => $timeout,
CURLOPT_MAXREDIRS => 10,
);
$ch = curl_init($url);
curl_setopt_array($ch, $options);
$content = curl_exec($ch);
$err = curl_errno($ch);
$errmsg = curl_error($ch);
$header = curl_getinfo($ch);
curl_close($ch);
return $content;
}
}
$a = new A;
echo $a->getWebPage('www.yahoo.com');