如何使用简单的html dom和PHP来抓取页面?

时间:2015-02-20 12:42:58

标签: php html web-scraping simple-html-dom

我正在尝试获取<div id listing-page-cart-inner><div id="description text">以及<div id="tags">中的数据,但我发现很难挖掘数据。

任何人都可以指导我吗?我无法获取数据虽然我提到的第一个div我能够抓,但其他div我无法。当我循环第二个foreach时,需要更长的时间。

<?php
include_once('simple_html_dom.php');

$html = file_get_html('https://etsy.com/listing/107492702/');
//$val =  $html->find('div[id=listing-page-cart-inner]');


function scraping_etsy() {
    // create HTML DOM
    $html = file_get_html('https://etsy.com/listing/107492702/');

        foreach($html->find('div[id=listing-page-cart-inner]') as $article) 
    {
        // get title
        //$item['title'] = trim($article->find('h3', 0)->plaintext);
        // get details
        $item['details'] = trim($article->find('span', 0)->plaintext);
        // get intro
        //$lists = $articles->find('div[id=item-overview]');

        $item['list1'] = trim($article->find('li',0)->plaintext);
        $item['list2'] = trim($article->find('li',1)->plaintext);
        $item['list3'] = trim($article->find('li',2)->plaintext);
        $item['list4'] = trim($article->find('li',3)->plaintext);
        $item['list5'] = trim($article->find('li',4)->plaintext);

        /*foreach($article->find('li') as $al){
            $item['lists'] =trim($al->find('li')->plaintext);

        }*/

        $ret[] = $item;

    }


    foreach($html->find('div[id=description]') as $content){
        var_dump($content->find('text'));
        // $item['content'] = trim($content->find('div[id=description]')->plaintext);
        // $ret[] = $item;
    }
    // clean up memory
  $html->clear();
   unset($html);

    return $ret ;
}
$ret = scraping_etsy();

var_dump($ret);

/*foreach($ret as $v) {
    echo $v['title'].'<br>';
    echo '<ul>';
    echo '<li>'.$v['details'].'</li>';
    echo '<li>Diggs: '.$v['diggs'].'</li>';
    echo '</ul>';
}*/
?>

2 个答案:

答案 0 :(得分:1)

至于获取这些div的子项,请记住,如果找到了父元素,请始终使用->find('<the selector here>', 0)始终使用索引实际指向该元素。

$html = file_get_html('https://etsy.com/listing/107492702/');
// listings with description
$div = $html->find('div#listing-page-cart-inner', 0); // here index zero
$main_description = $div->find('h1', 0)->innertext;
echo $main_description . '<br/><br/>';

$div_item_overview = $div->find('div#item-overview ul.properties li');
foreach ($div_item_overview as $overview) {
    echo $overview->innertext . '<br/>';
}

// tags
$div_tag = $html->find('div#tags', 0); // here index zero pointing to that element
$tags = array();
foreach($div_tag->find('ul li') as $li) {
    $tags[] = $li->find('a', 0)->innertext;
}

echo '<pre>', print_r($tags, 1), '</pre>';

// description
$div_description = $html->find('div#description', 0)->plaintext; // here pointing to index zero
echo $div_description;

答案 1 :(得分:0)

最简单的启动方式是使用3d-party库,即Symfony DomCrawler

使用简单

use Symfony\Component\DomCrawler\Crawler;

$html = <<<'HTML'
<!DOCTYPE html>
<html>
    <body>
        <p class="message">Hello World!</p>
        <p>Hello Crawler!</p>
    </body>
</html>
HTML;

$crawler = new Crawler($html);

foreach ($crawler as $domElement) {
    print $domElement->nodeName;
}

您可以使用

等过滤器
$crawler = $crawler->filter('body > p');