为什么会员功能发现()缺少一个脚本而不是另一个脚本?

时间:2018-05-20 18:55:53

标签: php oop dom web-crawler

Php专家,

此脚本有效:

include('simple_html_dom.php'); 

$html = file_get_html('http://google.com'); 

//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
$links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 

//to fetch all images from a webpage 
$images = array(); 
foreach($html->find('img') as $img) { 
$images[] = $img->src; 
} 
print_r($images); 
echo "<br />"; 

//to find h1 headers from a webpage 
$headlines = array(); 
foreach($html->find('h1') as $header) { 
$headlines[] = $header->plaintext; 
} 
print_r($headlines); 
echo "<br />"; 

?> 

我没有得到“查找”无法识别的错误。 但是,为什么我对以下的修改会出现错误?

<?php  

/* FINDING HTML ELEMENTS BASED ON THEIR TAG NAMES 

Suppose you wanted to find each and every link on a webpage.  
We will be using “find” function to extract this information from the 
object. Here’s how to do it using Simple HTML DOM Parser : 
*/ 

include('simple_html_dom.php'); 

$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$html = curl_exec($curl); 

//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
$links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 

?>

我收到错误: 致命错误:未捕获错误:在C:\ xampp \ htdocs \ cURL \ crawler.php中调用字符串上的成员函数find():24堆栈跟踪:在C:\ xampp \ htdocs \ cURL \中抛出#0 {main}第24行的crawler.php

奇怪!为什么我对第一个有效的脚本上的“查找”没有同样的错误?很奇怪! 两个脚本几乎相同。在我的修改版本上,我刚刚替换了“$ html = file_get_html('');”与cURL。亲眼看看。

可以从这里下载simple_html_dom.php文件: https://sourceforge.net/projects/simplehtmldom/files/ 我把这个dom文件放在与脚本文件相同的目录中。 这意味着,我刚刚更换了:

//$html = file_get_html('http://nimishprabhu.com');

使用:

$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$html = curl_exec($curl); 

就是这样!

第一次编辑: u_mulder的代码正在开发一些网址但不在雅虎上。那是为什么?

$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 

$html = str_get_html($response_string);

//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 

1 个答案:

答案 0 :(得分:0)

因为file_get_html是来自simple_html_dom库的特殊功能。 如果您打开simple_html_dom的源代码,您会发现file_get_html() 确实很多curl替换的内容。这就是你收到错误的原因。

str_get_html的可能解决方案:

$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 

$html = str_get_html($response_string);

//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />";