PHP simple_html_dom用于解析分页页面中的链接

时间:2014-04-21 09:32:14

标签: php html parsing simple-html-dom scraper

我修改了下面的脚本,以获取代码中$ url设置的所有链接。

我似乎在某种程度上工作,它获取所有页面URL,但不解析所有页面。它只解析第一页并重复其余的结果。

有人可以告诉我这里做错了什么,我已经花了一天多时间尝试一切。我还包括了我得到的结果。

<?php
include('simple_html_dom.php');
$base = "http://singersroom.com";
$url = "http://singersroom.com/subcontent/rnb-news/";

// Start from the main page
$nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
    echo "<hr>nextLink: $nextLink<br>";
    //Create a DOM object
    $html = new simple_html_dom();
    // Load HTML from a url
    $html->load_file($nextLink);
    $posts = $html->find('h3[class=prl-article-title]');
    foreach($posts as $post) {
        // Get the link
        $articles = $post->children(0)->href;        
        echo $base,$articles.'</br>';
    }
    // Extract the next link, if not found return NULL
    //$nextLink = ( ($temp = $html->find('div[class=pagination]', 0)->last_child()) ? $temp->href : NULL );

    //$nextLink = ( ($temp = $html->find('div.pagination a[class="Next >>"]', 0)) ? "http://singersroom.com/subcontent/rnb-news/".$temp->href : NULL );
    $nextLink = ( ($temp = $html->find('div[class=pagination]', 0)->last_child()) ? "http://singersroom.com/subcontent/rnb-news/".$temp->href : NULL );

    //echo $temp;
    // Clear DOM object
    $html->clear();
    unset($html);
}

?>

以下是我得到的结果:

  

nextLink:hxxp://singersroom.com/subcontent/rnb-news/   hxxp://singersroom.com/content/2014-04-18/Prince-Collabs-with-Warner-Bros-for-New-Music-Purple-Rain-Anniversary-Album/ hxxp://singersroom.com/content/ 2014年4月17日/玛 - 布拉克斯顿 - 添加 - 巡回赛的日期,感谢球迷-FOR-支持/   hxxp://singersroom.com/content/2014-04-14/Tamar-Braxton-Readies-New-Album-Inks-Third-Season-of-Tamar-Vince/   hxxp://singersroom.com/content/2014-04-14/Jennifer-Hudson-Walk-It-Out-Ft-Timbaland/   hxxp://singersroom.com/content/2014-04-15/Kindred-The-Family-Soul-Everybodys-Hustlin/   hxxp://singersroom.com/content/2014-04-15/Lyrica-Anderson-Freakin-ft-Wiz-Khalifa/   hxxp://singersroom.com/content/2014-04-07/Dont-Worry-About-Them-10-Baby-Mothers-That-Are-Doing-Just-Fine/hxxp://singersroom.com/content/ 2014年3月27日/前十名,最佳配乐,来自该90 /   hxxp://singersroom.com/content/2014-04-16/The-Forbes-Five-2014s-Wealthiest-Artists-in-Hip-Hop/   nextLink:hxxp://singersroom.com/subcontent/rnb-news/?page = 2   hxxp://singersroom.com/content/2014-04-18/Prince-Collabs-with-Warner-Bros-for-New-Music-Purple-Rain-Anniversary-Album/ hxxp://singersroom.com/content/ 2014年4月17日/玛 - 布拉克斯顿 - 添加 - 巡回赛的日期,感谢球迷-FOR-支持/   hxxp://singersroom.com/content/2014-04-14/Tamar-Braxton-Readies-New-Album-Inks-Third-Season-of-Tamar-Vince/   hxxp://singersroom.com/content/2014-04-14/Jennifer-Hudson-Walk-It-Out-Ft-Timbaland/   hxxp://singersroom.com/content/2014-04-15/Kindred-The-Family-Soul-Everybodys-Hustlin/   hxxp://singersroom.com/content/2014-04-15/Lyrica-Anderson-Freakin-ft-Wiz-Khalifa/   hxxp://singersroom.com/content/2014-04-07/Dont-Worry-About-Them-10-Baby-Mothers-That-Are-Doing-Just-Fine/hxxp://singersroom.com/content/ 2014年3月27日/前十名,最佳配乐,来自该90 /   hxxp://singersroom.com/content/2014-04-16/The-Forbes-Five-2014s-Wealthiest-Artists-in-Hip-Hop/   。 。 。 nextLink:hxxp://singersroom.com/subcontent/rnb-news/?page = 96   hxxp://singersroom.com/content/2014-04-18/Prince-Collabs-with-Warner-Bros-for-New-Music-Purple-Rain-Anniversary-Album/ hxxp://singersroom.com/content/ 2014年4月17日/玛 - 布拉克斯顿 - 添加 - 巡回赛的日期,感谢球迷-FOR-支持/   hxxp://singersroom.com/content/2014-04-14/Tamar-Braxton-Readies-New-Album-Inks-Third-Season-of-Tamar-Vince/   hxxp://singersroom.com/content/2014-04-14/Jennifer-Hudson-Walk-It-Out-Ft-Timbaland/   hxxp://singersroom.com/content/2014-04-15/Kindred-The-Family-Soul-Everybodys-Hustlin/   hxxp://singersroom.com/content/2014-04-15/Lyrica-Anderson-Freakin-ft-Wiz-Khalifa/   hxxp://singersroom.com/content/2014-04-07/Dont-Worry-About-Them-10-Baby-Mothers-That-Are-Doing-Just-Fine/hxxp://singersroom.com/content/ 2014年3月27日/前十名,最佳配乐,来自该90 /   hxxp://singersroom.com/content/2014-04-16/The-Forbes-Five-2014s-Wealthiest-Artists-in-Hip-Hop/

1 个答案:

答案 0 :(得分:0)

您的链接都是hxxp,这意味着它们不是有效链接。在您的网址中用http替换hxxp,您应该能够进入下一步。