Question

我正在尝试将单个html段落解析为其构造块的数组-我有这个html段落：

$element_content = '<p>Start of paragraph - <strong><em>This note</em></strong> provides <em>information</em> about the contractual terms.</p>';

到目前为止，我所做的是：

$dom = new DOMDocument();
$dom->loadXML($element_content);

foreach ($dom->getElementsByTagName('*') as $node) {

    echo $node->getNodePath().'<br>';
    echo $node->nodeValue.'<br>';
}

哪个给我这个结果：

/p
Start of paragraph - This note provides information about the contractual terms.
/p/strong
This note
/p/strong/em
This note
/p/em
information

但我想实现这一目标：

/p
Start of paragraph - 
/p/strong/em
This note
/p
 provides 
/p/em
information
/p
 about the contractual terms.

关于如何实现它的任何想法？

Answer 1

DOM中的所有内容都是一个节点。不只是元素，文本也是如此。您正在获取元素节点，但是结果将单独输出文本节点。因此，您需要获取不只是空格节点的DOM文本节点。使用Xpath表达式并不难：

//text()[normalize-space(.) != ""]

//text()获取文档中的任何文本节点（包括CDATA节）。 normalize-space()是一种Xpath函数，用于将字符串内的空白组减少为单个空格。前导和尾随空格将被删除。因此，[normalize-space(.) != ""]从列表中删除仅包含空格的所有节点。

每个文本节点的父节点是其元素。放在一起：

$document = new DOMDocument();
$document->loadXML($content);
$xpath = new DOMXpath($document);

$nodes = $xpath->evaluate('//text()[normalize-space(.) != ""]');

foreach ($nodes as $node) {
    echo $node->parentNode->getNodePath(), "\n";
    echo $node->textContent, "\n";
}

输出：

/p 
Start of paragraph - 
/p/strong/em
This note 
/p 
 provides 
/p/em 
information 
/p 
 about the contractual terms.

用php解析html段落，并根据其内容和样式分成各个标签

1 个答案: