Php专家,
此脚本有效:
include('simple_html_dom.php');
$html = file_get_html('http://google.com');
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
//to fetch all images from a webpage
$images = array();
foreach($html->find('img') as $img) {
$images[] = $img->src;
}
print_r($images);
echo "<br />";
//to find h1 headers from a webpage
$headlines = array();
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
print_r($headlines);
echo "<br />";
?>
我没有得到“查找”无法识别的错误。 但是,为什么我对以下的修改会出现错误?
<?php
/* FINDING HTML ELEMENTS BASED ON THEIR TAG NAMES
Suppose you wanted to find each and every link on a webpage.
We will be using “find” function to extract this information from the
object. Here’s how to do it using Simple HTML DOM Parser :
*/
include('simple_html_dom.php');
$url = 'https://www.yahoo.com';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$html = curl_exec($curl);
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
?>
我收到错误: 致命错误:未捕获错误:在C:\ xampp \ htdocs \ cURL \ crawler.php中调用字符串上的成员函数find():24堆栈跟踪:在C:\ xampp \ htdocs \ cURL \中抛出#0 {main}第24行的crawler.php
奇怪!为什么我对第一个有效的脚本上的“查找”没有同样的错误?很奇怪! 两个脚本几乎相同。在我的修改版本上,我刚刚替换了“$ html = file_get_html('');”与cURL。亲眼看看。
可以从这里下载simple_html_dom.php文件: https://sourceforge.net/projects/simplehtmldom/files/ 我把这个dom文件放在与脚本文件相同的目录中。 这意味着,我刚刚更换了:
//$html = file_get_html('http://nimishprabhu.com');
使用:
$url = 'https://www.yahoo.com';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$html = curl_exec($curl);
就是这样!
第一次编辑: u_mulder的代码正在开发一些网址但不在雅虎上。那是为什么?$url = 'https://www.yahoo.com';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";
答案 0 :(得分:0)
因为file_get_html
是来自simple_html_dom
库的特殊功能。
如果您打开simple_html_dom
的源代码,您会发现file_get_html()
确实很多curl
替换不的内容。这就是你收到错误的原因。
str_get_html
的可能解决方案:
$url = 'https://www.yahoo.com';
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
$response_string = curl_exec($curl);
$html = str_get_html($response_string);
//to fetch all hyperlinks from a webpage
$links = array();
foreach($html->find('a') as $a) {
$links[] = $a->href;
}
print_r($links);
echo "<br />";