Question

Php专家，

此脚本有效：

include('simple_html_dom.php'); 

$html = file_get_html('http://google.com'); 

//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
$links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 

//to fetch all images from a webpage 
$images = array(); 
foreach($html->find('img') as $img) { 
$images[] = $img->src; 
} 
print_r($images); 
echo "<br />"; 

//to find h1 headers from a webpage 
$headlines = array(); 
foreach($html->find('h1') as $header) { 
$headlines[] = $header->plaintext; 
} 
print_r($headlines); 
echo "<br />"; 

?>

我没有得到“查找”无法识别的错误。但是，为什么我对以下的修改会出现错误？

<?php  

/* FINDING HTML ELEMENTS BASED ON THEIR TAG NAMES 

Suppose you wanted to find each and every link on a webpage.  
We will be using “find” function to extract this information from the 
object. Here’s how to do it using Simple HTML DOM Parser : 
*/ 

include('simple_html_dom.php'); 

$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$html = curl_exec($curl); 

//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
$links[] = $a->href; 
} 
print_r($links); 
echo "<br />"; 

?>

我收到错误：致命错误：未捕获错误：在C：\ xampp \ htdocs \ cURL \ crawler.php中调用字符串上的成员函数find（）：24堆栈跟踪：在C：\ xampp \ htdocs \ cURL \中抛出＃0 {main}第24行的crawler.php

奇怪！为什么我对第一个有效的脚本上的“查找”没有同样的错误？很奇怪！两个脚本几乎相同。在我的修改版本上，我刚刚替换了“$ html = file_get_html（''）;”与cURL。亲眼看看。

可以从这里下载simple_html_dom.php文件： https://sourceforge.net/projects/simplehtmldom/files/ 我把这个dom文件放在与脚本文件相同的目录中。这意味着，我刚刚更换了：

//$html = file_get_html('http://nimishprabhu.com');

使用：

$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$html = curl_exec($curl);

就是这样！

第一次编辑： u_mulder的代码正在开发一些网址但不在雅虎上。那是为什么？

$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 

$html = str_get_html($response_string);

//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />";

Answer 1

因为file_get_html是来自simple_html_dom库的特殊功能。如果您打开simple_html_dom的源代码，您会发现file_get_html() 确实很多curl替换不的内容。这就是你收到错误的原因。

str_get_html的可能解决方案：

$url = 'https://www.yahoo.com'; 
$curl = curl_init($url); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); 
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); 
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0); 
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0); 
$response_string = curl_exec($curl); 

$html = str_get_html($response_string);

//to fetch all hyperlinks from a webpage 
$links = array(); 
foreach($html->find('a') as $a) { 
    $links[] = $a->href; 
} 
print_r($links); 
echo "<br />";

为什么会员功能发现（）缺少一个脚本而不是另一个脚本？

1 个答案: