简单的HTML DOM嵌套尝试捕获下一个

时间:2013-03-04 03:09:46

标签: php simple-html-dom

如何检索a.page_arrows的最后一次出现

    <div class="page-nav">  
    <a class="paginationNumberStyle page_arrows" data-url="/Building-Materials-Concrete-Cement-Masonry/h_d1/N-5yc1vZ25ecodZarlk/h_d2/Navigation?catalogId=10053&amp;Nu=P_PARENT_ID&amp;langId=-1&amp;Nao=384&amp;storeId=10051"> 
    <img alt="" src="/static/images/layout/triangle-green-left.gif"></a>                          
    <span>6</span>
    <a class="paginationNumberStyle" data-url="/Building-Materials-Concrete-Cement-Masonry/h_d1/N-5yc1vZ25ecodZarlk/h_d2/Navigation?catalogId=10053&amp;Nu=P_PARENT_ID&amp;langId=-1&amp;Nao=576&amp;storeId=10051">7</a>
    <a class="paginationNumberStyle" data-url="/Building-Materials-Concrete-Cement-Masonry/h_d1/N-5yc1vZ25ecodZarlk/h_d2/Navigation?catalogId=10053&amp;Nu=P_PARENT_ID&amp;langId=-1&amp;Nao=672&amp;storeId=10051">8</a>
    <a class="paginationNumberStyle page_arrows" data-url="/Building-Materials-Concrete-Cement-Masonry/h_d1/N-5yc1vZ25ecodZarlk/h_d2/Navigation?catalogId=10053&amp;Nu=P_PARENT_ID&amp;langId=-1&amp;Nao=576&amp;storeId=10051"> 
    <img alt="" src="/static/images/layout/triangle-green-right.gif"></a>
</div>  

我正在尝试收集链接,然后转到下一页并收集其余链接,直到没有嵌套页面。这是我的代码:

        getLinks('http://www.homedepot.com/Building-Materials-Concrete-Cement-Masonry/h_d1/N-5yc1vZ25ecodZarlk/h_d2/Navigation?catalogId=10053&Nu=P_PARENT_ID&langId=-1&storeId=10051&currentPLP=true&omni=c_Concrete,%20Cement%20&%20Masonry&searchNav=true');

   function getLinks($URL) {


$html = file_get_contents($URL);

$dom = new simple_html_dom();
$dom -> load($html);

    foreach ($dom->find('a[class=item_description]') as $href){
  $url = $href->href;
  echo $url.'<br>';
 }

if ($nextPage = $dom->find("a[class=paginationNumberStyle]" ,0)){ 
    $nextPageURL = 'http://www.homedepot.com'.$nextPage->getAttribute('data-url'); 

    $dom -> clear();
    unset($dom);
    getLinks($nextPageURL);
} else {
    echo "\nEND";
    $dom -> clear();
    unset($dom);
}

}   

1 个答案:

答案 0 :(得分:1)

我遇到了同样的问题,并使用了children方法来抓取第一级项目。

<ul class="my-list">
<li>
    <a href="#">Some Text</a>
    <ul>
        <li><a href="#">Some Inner Text</a></li>
        <li><a href="#">Some Inner Text</a></li>
        <li><a href="#">Some Inner Text</a></li>
        <li><a href="#">Some Inner Text</a></li>
    </ul>
</li>
<li>
    <a href="#">Some Text</a>
    <ul>
        <li><a href="#">Some Inner Text</a></li>
        <li><a href="#">Some Inner Text</a></li>
        <li><a href="#">Some Inner Text</a></li>
        <li><a href="#">Some Inner Text</a></li>
    </ul>
</li>
</ul>

这里是Simple HTML Dom代码,只获得第一级li项:

$html = file_get_html( $url );
$first_level_items = $html->find( '.my-list', 0)->children();

foreach ( $first_level_items as $item ) {
... do stuff ...
}