解析HTML并获取所有h3' s在h2之后的下一个h2之前使用PHP

时间:2013-08-09 21:48:06

标签: php parsing dom html-parsing domdocument

我希望在文章中找到第一个h2。一旦找到,找到所有h3,直到找到下一个h2。冲洗并重复,直到找到所有标题和副标题。

在您立即将此问题标记或关闭为重复解析问题之前,请注意问题标题,因为这与基本节点检索无关。我已经把那部分搞定了。

我使用DOMDocument使用DOMDocument::loadHTML()DOMDocument::getElementsByTagName()DOMDocument::saveHTML()来解析HTML,以检索文章的重要标题。

我的代码如下:

$matches = array();
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach($dom->getElementsByTagName('h2') as $node) {
    $matches['heading-two'][] = $dom->saveHtml($node);
}
foreach($dom->getElementsByTagName('h3') as $node) {
    $matches['heading-three'][] = $dom->saveHtml($node);
}
if($matches){
    $this->key_points = $matches;
}

这给了我类似的输出:

array(
    'heading-two' => array(
        '<h2>Here is the first heading two</h2>',
        '<h2>Here is the SECOND heading two</h2>'
    ),
    'heading-three' => array(
        '<h3>Here is the first h3</h3>',
        '<h3>Here is the second h3</h3>',
        '<h3>Here is the third h3</h3>',
        '<h3>Here is the fourth h3</h3>',
    )
);

我希望有更多类似的内容:

array(
    '<h2>Here is the first heading two</h2>' => array(
        '<h3>Here is an h3 under the first h2</h3>',
        '<h3>Here is another h3 found under first h2, but after the first h3</h3>'
    ),
    '<h2>Here is the SECOND heading two</h2>' => array(
        '<h3>Here is an h3 under the SECOND h2</h3>',
        '<h3>Here is another h3 found under SECOND h2, but after the first h3</h3>'
    )
);

我并不是在寻找代码完成(如果你认为通过这样做会更好地帮助其他人 - 继续),但或多或​​少的指导或建议正确的方向来完成一个嵌套数组,如上面的上面

2 个答案:

答案 0 :(得分:6)

我假设所有标题都在DOM中处于同一级别,因此每个h3都是h2的兄弟。有了这个假设,你可以迭代h2的兄弟,直到遇到下一个h2:

foreach($dom->getElementsByTagName('h2') as $node) {
    $key = $dom->saveHtml($node);
    $matches[$key] = array();
    while(($node = $node->nextSibling) && $node->nodeName !== 'h2') {
        if($node->nodeName == 'h3') {
            $matches[$key][] = $dom->saveHtml($node);   
        }
    }
}

答案 1 :(得分:1)

这也可以通过获取在文档中找到节点元素的行号并将其存储为数组元素键来工作,然后ksort($matches)将数组中的每个节点元素返回到其原始行在HTML文档中找到的位置。

$matches = array();
$dom = new DOMDocument;
$dom->loadHTML($content);

foreach($dom->getElementsByTagName('h2') as $node) {
    $matches[$node->getLineNo()] = $dom->saveHtml($node);
}
foreach($dom->getElementsByTagName('h3') as $node) {
    $matches[$node->getLineNo()] = $dom->saveHtml($node);
}

ksort($matches);

......或者更严格的代码;

foreach(array('h2', 'h3') as $tag) {
    foreach($dom->getElementsByTagName($tag) as $node) {
        $matches[$node->getLineNo()] = $dom->saveHtml($node);
    }
}

ksort($matches);