我需要从HTML文档中删除一些值以及一些原始HTML。我想过使用XPath,但我不能让我的查询工作。
这是我想要实现的目标:
<div class="unit-id">
<div class="title">
some title-1
</div>
<div class="another-class">
another class
</div>
<p>segwegw1<p>
<p>segwegw1<p>
<p>segwegw1<p>
<p>segwegw1<p>
<ul>
<li>jfjfj</li>
<li>jfjfj</li>
<li>jfjfj</li>
</ul>
</div>
<div class="unit-id">
<div class="title">
some title-2
</div>
<div class="another-class">
some other class
</div>
<p>segwegw2<p>
<p>segwegw2<p>
<p>segwegw2<p>
<p>segwegw2<p>
</div>
<div class="unit-id">
<div class="title">
some title-3
</div>
<div class="some-other-class">
some other data
</div>
<p>segwegw3<p>
<p>segwegw3<p>
<p>segwegw3<p>
<p>segwegw3<p>
</div>
所以我希望查询使用unit-id类迭代每个div
,并返回divs
的值title
,其余的HTML,不再包括divs
,只有p
标记和ul
内容的特定单位ID归类为div
,然后是下一次迭代。
这可能吗?你能给我一个如何编写这个查询的例子吗?有没有更好的方法呢?
答案 0 :(得分:3)
此代码与您正在寻找的内容类似:
function get_content($data){
$doc = new DOMDocument();
//load HTML string into document object
if ( ! @$doc->loadHTML($data)){
return FALSE;
}
//create XPath object using the document object as the parameter
$xpath = new DOMXPath($doc);
$query = "//div[@class='unit-id']";
//XPath queries return a NodeList
$res = $xpath->query($query);
$out = array();
foreach ($res as $key => $node){
//subquery
$sub = $xpath->query('.//div[@class="title"]', $node);
$out[$key]['title'] = trim($sub->item(0)->nodeValue);
foreach ($node->getElementsByTagName('p') as $key2 => $value){
$out[$key]['par'][$key2] = $value->nodeValue;
}
foreach ($node->getElementsByTagName('li') as $key2 => $value){
$out[$key]['list'][$key2] = $value->nodeValue;
}
}
return $out;
}
请注意,您的HTML中存在错误。您关闭的段落标记应该有尾部斜杠</p>
。
这是输出:
array
0 =>
array
'title' => string 'some title-1' (length=12)
'par' =>
array
0 => string 'segwegw1' (length=8)
1 => string 'segwegw1' (length=8)
2 => string 'segwegw1' (length=8)
3 => string 'segwegw1' (length=8)
'list' =>
array
0 => string 'jfjfj' (length=5)
1 => string 'jfjfj' (length=5)
2 => string 'jfjfj' (length=5)
1 =>
array
'title' => string 'some title-2' (length=12)
'par' =>
array
0 => string 'segwegw2' (length=8)
1 => string 'segwegw2' (length=8)
2 => string 'segwegw2' (length=8)
3 => string 'segwegw2' (length=8)