我想通过simple_html_dom提取html字符串的所有p元素。应该获得p元素的顺序。
<section class="box_1">
<header class="trigger"><h2>Title</h2></header>
<div class="content">
<div class="box_2">
<div class="class"></div>
<div class="content">
<p>Text Level 2</p>
<p>More Text Level 2</p>
</div>
</div>
<div class="box_2">
<div class="class"></div>
<div class="content">
<p>Text Level 2</p>
<div class="box_3">
<div class="content">
<p>Text Level 3</p>
</div>
</div>
</div>
</div>
</div>
</section>
但是同一内容容器中的所有p元素应该合并在一起。
我试过了:
foreach($html->find('p') as $element) {
if ($element->parent()->parent()) {
$class= $element->parent()->parent()->getAttribute('class');
if ($class=="box_3") $level = 3;
else if ($class=="box_2") $level = 2;
else if ($class=="box_1") $level = 1;
}
else { $level = 0; }
$array_content_element = array("level" => $level, "inhalt" => $element->plaintext);
array_push($array_content, $array_content_element);
}
但是这就是&#34; Text Level 2&#34;和&#34;更多文字等级2&#34;将作为两个元素处理。但是它们应该合并到&#34;文本级别2 \ n更多文本级别2&#34;这应该作为一个元素处理。
所以在这个例子中,结果应该是一个包含三个元素(而不是四个)的数组。
更新:我忘记了什么。在section-elements之外可以有p元素。请看下面的&#34; Lorem ipsum&#34;。
<p>Lorem ipsum</p>
<p>Lorem ipsum</p>
<section class="box_1">
<header class="trigger"><h2>Title</h2></header>
<div class="content">
<div class="box_2">
<div class="class"></div>
<div class="content">
<p>Text Level 2</p>
<p>More Text Level 2</p>
</div>
</div>
<div class="box_2">
<div class="class"></div>
<div class="content">
<p>Text Level 2</p>
<div class="box_3">
<div class="content">
<p>Text Level 3</p>
</div>
</div>
</div>
</div>
</div>
</section>
<p>Lorem ipsum</p>
<p>Lorem ipsum</p>
<section class="box_1">
<header class="trigger"><h2>Title</h2></header>
<div class="content">
<p>Text Level 1</p>
</div>
</section>
<p>Lorem ipsum</p>
<p>Lorem ipsum</p>
这些p元素应该像其他元素一样处理(总结一个块的p元素)。在这种情况下,level = 0。
答案 0 :(得分:2)
您必须先确定哪个是哪个。它是孤儿还是不孤儿。然后,如果它到达批处理的末尾,则只需更改为下一个键/批处理(不再留下p
个标记)。考虑这个例子:
include 'simple_html_dom.php';
$html_string = '<p>Lorem ipsum</p><p>Lorem ipsum</p><section class="box_1"> <header class="trigger"><h2>Title</h2></header> <div class="content"> <div class="box_2"> <div class="class"></div> <div class="content"> <p>Text Level 2</p> <p>More Text Level 2</p> </div> </div> <div class="box_2"> <div class="class"></div> <div class="content"> <p>Text Level 2</p> <div class="box_3"> <div class="content"> <p>Text Level 3</p> </div> </div> </div> </div> </div></section><p>Lorem ipsum</p><p>Lorem ipsum</p><section class="box_1"> <header class="trigger"><h2>Title</h2></header> <div class="content"> <p>Text Level 1</p> </div></section><p>Lorem ipsum</p><p>Lorem ipsum</p>';
$html = str_get_html($html_string);
$array_content = array();
$index = 0;
foreach($html->find('p') as $key => $tag) {
if($tag->parent()->tag == 'root') {
// if alone p tag
if(!isset($array_content[$index])) {
$array_content[$index] = array('level' => 0, 'inhalt' => $tag->innertext);
} else {
$array_content[$index]['inhalt'] .= "\n" . $tag->innertext;
}
} elseif($tag->parent->class == 'content') {
// handle tags with proper parents
$type = $tag->parent->parent->class;
switch($type) {
case 'box_1': $level = 1; break;
case 'box_2': $level = 2; break;
case 'box_3': $level = 3; break;
}
if(!isset($array_content[$index])) {
$array_content[$index] = array('level' => $level, 'inhalt' => $tag->innertext);
} else {
$array_content[$index]['inhalt'] .= "\n" . $tag->innertext;
}
}
// change index if set to next batch
if(!isset($tag->next_sibling()->tag) || $tag->next_sibling()->tag != 'p') {
$index++;
}
}
echo '<pre>';
print_r($array_content);
应输出:
Array
(
[0] => Array
(
[level] => 0
[inhalt] => Lorem ipsum
Lorem ipsum
)
[1] => Array
(
[level] => 2
[inhalt] => Text Level 2
More Text Level 2
)
[2] => Array
(
[level] => 2
[inhalt] => Text Level 2
)
[3] => Array
(
[level] => 3
[inhalt] => Text Level 3
)
[4] => Array
(
[level] => 0
[inhalt] => Lorem ipsum
Lorem ipsum
)
[5] => Array
(
[level] => 1
[inhalt] => Text Level 1
)
[6] => Array
(
[level] => 0
[inhalt] => Lorem ipsum
Lorem ipsum
)
)