需要简单的HTML Dom PHP帮助

时间:2011-03-14 18:56:47

标签: php

我正在使用PHP和Simple HTML Dom进行一些屏幕抓取工作。我在目标标记中找到一些一致性,我正在努力。这些div都被奇怪地命名。见例......

<!-- Page START -->
<h2>Small houses</h2>
<p id="imPathTitle">Dolls Houses</p>
<div id="imPage">

<div id="imCel1_02">
<div id="imCel1_02_Cont">
    <div id="imObj1_02">
<img src="images/daisylane.jpg" alt="" title="" />
    </div>
</div>
</div>

<div id="imCel1_00">
<div id="imCel1_00_Cont">
    <div id="imObj1_00">
<img src="images/1_h117.jpg" alt="" title="" />
    </div>
</div>
</div>

<div id="imCel0_00">
<div id="imCel0_00_Cont">
    <div id="imObj0_00">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H117: Daisy Cottage</span><span class="ff2 fc2 fs10 ">

<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door,<br />decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /><br /></span><span class="ff2 fc4 fs10 fb ">W440mm D350mm H425mm</span><span class="ff2 fc2 fs10 ">
<br /></span></p>
    </div>
</div>
</div>

<div id="imCel1_01">
<div id="imCel1_01_Cont">
    <div id="imObj1_01">

<img src="images/2_h111.jpg" alt="" title="" />
    </div>
</div>
</div>

<div id="imCel0_01">
<div id="imCel0_01_Cont">
    <div id="imObj0_01">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H111: Lilys Cottage</span><span class="ff3 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door, decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /></span><span class="ff2 fc4 fs10 fb ">

<br />W440mm D350mm H425mm</span><span class="ff2 fc4 fs10 ">
<br /></span></p>
    </div>
</div>
</div>

<div id="imCel0_02">
<div id="imCel0_02_Cont">
    <div id="imObj0_02">
<p class="imAlign_left"><span class="ff2 fc3 fs10 "> le toy van, wodden toys, designed in uk, pirate, fantasy, everyday, historical, fairytale, dolls, manufactured in indonesia, traditional wooden toys, fabric clothing,<br />designed in the uk manufactured in indonesia, copyright le toy van ltd, manufacturer distributor, designer, dolls houses, castles, garages, cars, budkins, traditional wooden toys, fairies, farms<br /></span></p>
    </div>

</div>
</div>

</div>
<!-- Page END -->

这个页面中有两个产品,它们似乎正在使用像表格一样的div? 我可以定位哪些元素来获取“图像”“标题”“描述”。我现在正在使用它......

foreach($all_pages->find('img') as $src){

    if (strpos($src->src,"http://letoyvan.com") === false) {
        $src->src = "http://letoyvan.com/$src->src";
    }
       $product['image'][] = $src->src;
}

foreach($all_pages->find('p[class*=imAlign_left]') as $description){
       $product['description'][] =  $description->innertext;
}

foreach($all_pages->find('span[class*=fc3]') as $title){
       $product['title'][] =  $title->innertext;
}

1 个答案:

答案 0 :(得分:2)

简单的html dom在世界上没有任何记忆,DOMDocument要好得多,这是一个例子:

    $page = <<< HTML
    <html>
    <head>
    <title>Test DOMDocument</title>
    </head>
    <body>
    <!-- Page START -->
    <h2>Small houses</h2>
    <p id="imPathTitle">Dolls Houses</p>
    <div id="imPage">

    <div id="imCel1_02">
    <div id="imCel1_02_Cont">
        <div id="imObj1_02">
    <img src="images/daisylane.jpg" alt="" title="" />
        </div>
    </div>
    </div>

    <div id="imCel1_00">
    <div id="imCel1_00_Cont">
        <div id="imObj1_00">
    <img src="images/1_h117.jpg" alt="" title="" />
        </div>
    </div>
    </div>

    <div id="imCel0_00">
    <div id="imCel0_00_Cont">
        <div id="imObj0_00">
    <p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H117: Daisy Cottage</span><span class="ff2 fc2 fs10 ">

    <br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door,<br />decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /><br /></span><span class="ff2 fc4 fs10 fb ">W440mm D350mm H425mm</span><span class="ff2 fc2 fs10 ">
    <br /></span></p>
        </div>
    </div>
    </div>

    <div id="imCel1_01">
    <div id="imCel1_01_Cont">
        <div id="imObj1_01">

    <img src="images/2_h111.jpg" alt="" title="" />
        </div>
    </div>
    </div>

    <div id="imCel0_01">
    <div id="imCel0_01_Cont">
        <div id="imObj0_01">
    <p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H111: Lilys Cottage</span><span class="ff3 fc2 fs10 ">
    <br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door, decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /></span><span class="ff2 fc4 fs10 fb ">

    <br />W440mm D350mm H425mm</span><span class="ff2 fc4 fs10 ">
    <br /></span></p>
        </div>
    </div>
    </div>

    <div id="imCel0_02">
    <div id="imCel0_02_Cont">
        <div id="imObj0_02">
    <p class="imAlign_left"><span class="ff2 fc3 fs10 "> le toy van, wodden toys, designed in uk, pirate, fantasy, everyday, historical, fairytale, dolls, manufactured in indonesia, traditional wooden toys, fabric clothing,<br />designed in the uk manufactured in indonesia, copyright le toy van ltd, manufacturer distributor, designer, dolls houses, castles, garages, cars, budkins, traditional wooden toys, fairies, farms<br /></span></p>
        </div>

    </div>
    </div>

    </div>
    <!-- Page END -->
    </body>
    </html>
HTML;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->load($page);
foreach($dom->getElementsByTagName('img') as $img)
{
    if (strpos($img->getAttribute('src'),"http://letoyvan.com") === false) {
        $src->src = "http://letoyvan.com/" . $img->getAttribute('src');
    }
       $product['image'][] = $img->getAttribute('src');

};

foreach($dom->getElementsByTagName('p') as $para) 
{
    if ($para->hasAttributes()) 
    {
         if ($para->getAttribute('class') == "imAlign_left")
         {
             $product['description'][] =  $para->nodeValue;
         }
    }
}

foreach($dom->getElementsByTagName('span') as $span) 
{
    if ($span->hasAttributes()) 
    {
         if ($span->getAttribute('class') == "fc3")
         {
            $product['title'][] =  $span->nodeValue;
         }
    }
}

如果您需要说明来保留html,可以使用此功能

 function DOMinnerHTML($element) 
    { 
        $innerHTML = ""; 
        $children = $element->childNodes; 
        foreach ($children as $child) 
        { 
            $tmp_dom = new DOMDocument(); 
            $tmp_dom->appendChild($tmp_dom->importNode($element, true)); 
            $innerHTML = trim($tmp_dom->saveHTML()); 
        } 

        return $innerHTML;
    }