希望这是一个非常简单的解决方案,我是PHP的新手,所以我可能会遗漏一些明显的东西。我正在使用ScraperWiki构建一个刮刀(虽然这是PHP的一个问题,与SW无关)。代码如下:
<?php
require 'scraperwiki/simple_html_dom.php';
$allLinks = array();
function nextPage($nextUrl, $y)
{
getLinks($nextUrl, $y);
}
function getLinks($url) // gets links from product list page
{
global $allLinks;
$html_content = scraperwiki::scrape($url);
$html = str_get_html($html_content);
if (isset($y)) {
$x = $y;
} else {
$x = 0;
}
foreach ($html->find("div.views-row a.imagecache-product_list") as $el) {
$url = $el->href . "\n";
$allLinks[$x] = 'http://www.foo.com';
$allLinks[$x] .= $url;
$x++;
}
$next = $html->find("li.pager-next a", 0)->href . "\n";
print_r("Printing $next:");
print_r($next);
if (isset($next)) {
$nextUrl = 'http://www.foo.com';
$nextUrl .= $next;
print_r($nextUrl);
$y = $x;
print_r("Printing X:");
print_r($x);
print_r("Printing Y:");
print_r($y);
nextPage($nextUrl, $y);
} else {
return;
}
}
getLinks("http://www.foo.com/department/accessories");
print_r($allLinks);
?>
预期输出:脚本应该从第一页抓取所有链接,找到“下一页”按钮,从其URL抓取链接,从该URL找到“下一页”,等等等等。当没有剩下“下一页”链接时,它应该停止。
CURRENT OUTPUT :代码运行正常,但它应该停止运行。这是关键路线:
$next = $html->find("li.pager-next a", 0)->href . "\n";
if (isset($next)) { }
如果页面上存在li.pager-next a
,我只想运行“nextPage()”函数。以下是控制台的输出:
http://www.foo.com/department/accessories?page=1
http://www.foo.com/department/accessories?page=2
http://www.foo.com/department/accessories?page=3
http://www.foo.com/department/accessories?page=4
http://www.foo.com/department/accessories?page=5
http://www.foo.com/department/accessories?page=6
http://www.foo.com/department/accessories?page=7
http://www.foo.com/department/accessories?page=8
http://www.foo.com/department/accessories?page=9
http://www.foo.com/department/accessories?page=10
PHP Notice: Trying to get property of non-object in /home/scriptrunner/script.php on line 31
// THE LOOP SHOULD BREAK HERE BUT DOESN'T
http://www.foo.com
http://www.foo.com/home?page=1
http://www.foo.com/home?page=2
http://www.foo.com/home?page=3
http://www.foo.com/home?page=4
http://www.foo.com/home?page=5
http://www.foo.com/home?page=6
http://www.foo.com/home?page=7
答案 0 :(得分:1)
这个怎么样:
$next = $html->find("li.pager-next a", 0);
if (isset($next)) {
$nextUrl = 'http://www.foo.com';
$nextUrl .= $next->href; // move ->href here
print_r($nextUrl . "\n"); // put \n here since we don't actually want that char in the url
$y = $x;
print_r("Printing X:");
print_r($x);
print_r("Printing Y:");
print_r($y);
nextPage($nextUrl, $y);
} else {
return;
}
答案 1 :(得分:0)
返回的是什么值
$next = $html->find("li.pager-next a", 0)->href . "\n";
在向isset($next)
附加"\n"
时,永远不会导致$nextElement = $html->find("li.pager-next a", 0);
if(isset($nextElement))
{
$nextUrl = 'http://www.foo.com' . $nextElement->href . PHP_EOL;
print_r($nextUrl);
$y = $x;
print_r("Printing X:");
print_r($x);
print_r("Printing Y:");
print_r($y);
nextPage($nextUrl, $y);
}
返回false。
使用类似的东西:
{{1}}
答案 2 :(得分:-2)
只需删除isset()
if($next){ }