我无法让我的刮刀返回我正在寻找的具体内容。如果我返回$ output,我看到digg好像是在我的服务器上托管,所以我知道我正确地访问了网站,我只是无法访问新的元素DOM。我做错了什么?
<?php
include('simple_html_dom.php');
function curl_download($url) {
$ch = curl_init(); //creates a new cURL resource handle
curl_setopt($ch, CURLOPT_URL, "http://digg.com"); // Set URL to download
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0"); // Set a referer
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true ); // Should cURL return or print out the data? (true = return, false = print)
curl_setopt($ch, CURLOPT_HEADER, 0); // Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_TIMEOUT, 10); // Timeout in seconds
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
}
$html = new simple_html_dom();
$html->load($output, true, false );
foreach($html->find('div.digg-story__kicker') as $article) {
$article_title = $article->find('.digg-story__kicker')->innertext;
return $article_title;
}
echo $article_title;
?>
编辑:好的,愚蠢的错误,我现在正在调用该函数:
$html = curl_download('http://digg.com')
如果我回复$ html我看到&#34;镜像网站&#34;,但当我使用str_get_html($html)
时,simple_html_dom.php会说//get html dom from string
我会继续这样做错误讯息:
致命错误:在第31行的/home/andrew73124/public_html/scraper/scraper.php中调用null上的成员函数str_get_html()
答案 0 :(得分:0)
你的循环很奇怪,你循环标题,所以只需访问innertext属性:
foreach($html->find('div.digg-story__kicker') as $article) {
echo $article->innertext;
}
答案 1 :(得分:0)
curl函数需要一个额外的设置 - 即CURLOPT_FOLLOWLOCATION
,函数本身需要返回一个值才能使用它的值。在下面的代码中,我返回一个包含响应和信息的对象,允许您在尝试处理响应数据之前测试http_code。
这使用标准DOMDocument但毫无疑问使用simple_dom很容易做到。
function curl_download( $url ) {
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );/* NEW */
curl_setopt( $ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0" );
curl_setopt( $ch, CURLOPT_HEADER, 0 );
curl_setopt( $ch, CURLOPT_TIMEOUT, 10 );
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
return (object)array(
'response' => $output,
'info' => $info
);
}
$output = curl_download( 'http://www.digg.com' );
if( $output->info['http_code']==200 ){
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->substituteEntities=true;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( $output->response );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$col=$xp->query('//div[@class="digg-story__kicker"]');
if( !empty( $col ) ){
foreach( $col as $node )echo $node->nodeValue;
}
} else {
echo '<pre>',print_r($output->info,true),'</div>';
}
更新了答案,以包含libxml
提供的错误缓解代码 - 尽管在添加libxml
错误处理代码之前,代码在本地运行时没有问题....
没有CURLOPT_FOLLOWLOCATION
设置我得到:
Array
(
[url] => http://www.digg.com
[content_type] => text/html
[http_code] => 301
[header_size] => 191
[request_size] => 79
[filetime] => -1
[ssl_verify_result] => 0
[redirect_count] => 0
[total_time] => 0.421
[namelookup_time] => 0.031
[connect_time] => 0.234
[pretransfer_time] => 0.234
[size_upload] => 0
[size_download] => 185
[speed_download] => 439
[speed_upload] => 0
[download_content_length] => 185
[upload_content_length] => 0
[starttransfer_time] => 0.421
[redirect_time] => 0
[certinfo] => Array
(
)
)
但是CURLOPT_FOLLOWLOCATION
设置为true
我得
WE'VE SEEN BETTER ANIME TRIBUTE VIDEOS...<more>...RESIST THE URGE TO SUBTWEET A BAD APPLE