我正在使用PHP和Simple HTML Dom进行一些屏幕抓取工作。我在目标标记中找到一些一致性,我正在努力。这些div都被奇怪地命名。见例......
<!-- Page START -->
<h2>Small houses</h2>
<p id="imPathTitle">Dolls Houses</p>
<div id="imPage">
<div id="imCel1_02">
<div id="imCel1_02_Cont">
<div id="imObj1_02">
<img src="images/daisylane.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel1_00">
<div id="imCel1_00_Cont">
<div id="imObj1_00">
<img src="images/1_h117.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel0_00">
<div id="imCel0_00_Cont">
<div id="imObj0_00">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H117: Daisy Cottage</span><span class="ff2 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door,<br />decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /><br /></span><span class="ff2 fc4 fs10 fb ">W440mm D350mm H425mm</span><span class="ff2 fc2 fs10 ">
<br /></span></p>
</div>
</div>
</div>
<div id="imCel1_01">
<div id="imCel1_01_Cont">
<div id="imObj1_01">
<img src="images/2_h111.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel0_01">
<div id="imCel0_01_Cont">
<div id="imObj0_01">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H111: Lilys Cottage</span><span class="ff3 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door, decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /></span><span class="ff2 fc4 fs10 fb ">
<br />W440mm D350mm H425mm</span><span class="ff2 fc4 fs10 ">
<br /></span></p>
</div>
</div>
</div>
<div id="imCel0_02">
<div id="imCel0_02_Cont">
<div id="imObj0_02">
<p class="imAlign_left"><span class="ff2 fc3 fs10 "> le toy van, wodden toys, designed in uk, pirate, fantasy, everyday, historical, fairytale, dolls, manufactured in indonesia, traditional wooden toys, fabric clothing,<br />designed in the uk manufactured in indonesia, copyright le toy van ltd, manufacturer distributor, designer, dolls houses, castles, garages, cars, budkins, traditional wooden toys, fairies, farms<br /></span></p>
</div>
</div>
</div>
</div>
<!-- Page END -->
这个页面中有两个产品,它们似乎正在使用像表格一样的div? 我可以定位哪些元素来获取“图像”“标题”“描述”。我现在正在使用它......
foreach($all_pages->find('img') as $src){
if (strpos($src->src,"http://letoyvan.com") === false) {
$src->src = "http://letoyvan.com/$src->src";
}
$product['image'][] = $src->src;
}
foreach($all_pages->find('p[class*=imAlign_left]') as $description){
$product['description'][] = $description->innertext;
}
foreach($all_pages->find('span[class*=fc3]') as $title){
$product['title'][] = $title->innertext;
}
答案 0 :(得分:2)
简单的html dom在世界上没有任何记忆,DOMDocument要好得多,这是一个例子:
$page = <<< HTML
<html>
<head>
<title>Test DOMDocument</title>
</head>
<body>
<!-- Page START -->
<h2>Small houses</h2>
<p id="imPathTitle">Dolls Houses</p>
<div id="imPage">
<div id="imCel1_02">
<div id="imCel1_02_Cont">
<div id="imObj1_02">
<img src="images/daisylane.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel1_00">
<div id="imCel1_00_Cont">
<div id="imObj1_00">
<img src="images/1_h117.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel0_00">
<div id="imCel0_00_Cont">
<div id="imObj0_00">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H117: Daisy Cottage</span><span class="ff2 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door,<br />decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /><br /></span><span class="ff2 fc4 fs10 fb ">W440mm D350mm H425mm</span><span class="ff2 fc2 fs10 ">
<br /></span></p>
</div>
</div>
</div>
<div id="imCel1_01">
<div id="imCel1_01_Cont">
<div id="imObj1_01">
<img src="images/2_h111.jpg" alt="" title="" />
</div>
</div>
</div>
<div id="imCel0_01">
<div id="imCel0_01_Cont">
<div id="imObj0_01">
<p class="imAlign_left"><span class="ff2 fc3 fs12 fb ">H111: Lilys Cottage</span><span class="ff3 fc2 fs10 ">
<br /></span><span class="ff2 fc4 fs10 ">Pretty painted cottage with daisy motif,<br />opening windows, shutters and door, decorated interior,<br />includes 'Starter furniture set'.<br />Dolls sold separately,<br />3 years+<br /></span><span class="ff2 fc4 fs10 fb ">
<br />W440mm D350mm H425mm</span><span class="ff2 fc4 fs10 ">
<br /></span></p>
</div>
</div>
</div>
<div id="imCel0_02">
<div id="imCel0_02_Cont">
<div id="imObj0_02">
<p class="imAlign_left"><span class="ff2 fc3 fs10 "> le toy van, wodden toys, designed in uk, pirate, fantasy, everyday, historical, fairytale, dolls, manufactured in indonesia, traditional wooden toys, fabric clothing,<br />designed in the uk manufactured in indonesia, copyright le toy van ltd, manufacturer distributor, designer, dolls houses, castles, garages, cars, budkins, traditional wooden toys, fairies, farms<br /></span></p>
</div>
</div>
</div>
</div>
<!-- Page END -->
</body>
</html>
HTML;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->load($page);
foreach($dom->getElementsByTagName('img') as $img)
{
if (strpos($img->getAttribute('src'),"http://letoyvan.com") === false) {
$src->src = "http://letoyvan.com/" . $img->getAttribute('src');
}
$product['image'][] = $img->getAttribute('src');
};
foreach($dom->getElementsByTagName('p') as $para)
{
if ($para->hasAttributes())
{
if ($para->getAttribute('class') == "imAlign_left")
{
$product['description'][] = $para->nodeValue;
}
}
}
foreach($dom->getElementsByTagName('span') as $span)
{
if ($span->hasAttributes())
{
if ($span->getAttribute('class') == "fc3")
{
$product['title'][] = $span->nodeValue;
}
}
}
如果您需要说明来保留html,可以使用此功能
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($element, true));
$innerHTML = trim($tmp_dom->saveHTML());
}
return $innerHTML;
}