Question

我试图从网址抓取一些数据在简单的html dom的帮助下。但是当id启动我的爬虫时会发出错误

**无法打开流：HTTP请求失败！ HTTP / 1.1 404 Not Found **

我尝试了cUrl但是抛出了404错误。

这里我的php简单dom代码

function getURLContent($url)
{
$html = new simple_html_dom();
$html->load_file($url);
    /* i perfome some opetions here*/
}

和cUrl

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
$data = curl_exec($curl);
echo $data; 
curl_close($curl);

我怎么能这样做？？

提前致谢..

Answer 1

是尝试配置useragent

 curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

Answer 2

将这些添加到您的代码中并尝试

curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
curl_setopt($ch, CURLOPT_HEADER, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers); //set headers
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // set true for https urls

Answer 3

404错误很明显，找不到页面。尝试Fiddler捕获物理浏览器捕获所需的参数，并通过脚本中的cURL传递相同的参数。

如果您收到阻止错误页面，则表示尝试更改用户代理或使用proxy地址（您可以在互联网上轻松获得免费代理）或尝试在请求您的页面时维护会话，Fiddler将帮助你在这。

使用php进行网络抓取

3 个答案: