我正在使用domDocument来解析一些HTML,并希望用\ n替换中断。但是,我在识别文档中实际发生中断的位置时遇到了问题。
给出以下HTML代码片段 - 来自我正在阅读的更大文件$ dom-> loadHTMLFile($ pFilename):
<p>Multiple-line paragraph<br />that has a close tag</p>
和我的代码:
foreach ($dom->getElementsByTagName('*') as $domElement) {
switch (strtolower($domElement->nodeName)) {
case 'p' :
$str = (string) $domElement->nodeValue;
echo 'PARAGRAPH: ',$str,PHP_EOL;
break;
case 'br' :
echo 'BREAK: ',PHP_EOL;
break;
}
}
我明白了:
PARAGRAPH: Multiple-line paragraphthat has a close tag
BREAK:
如何在段落中识别该中断的位置,并将其替换为\ n?
或者有没有比使用domDocument解析HTML格式更好的替代方法?
答案 0 :(得分:8)
您无法使用getElementsByTagName
获取元素的位置。您应该浏览每个元素的childNodes
并分别处理文本节点和元素。
在一般情况下,您需要递归,如下所示:
function processElement(DOMNode $element){
foreach($element->childNodes as $child){
if($child instanceOf DOMText){
echo $child->nodeValue,PHP_EOL;
}elseif($child instanceOf DOMElement){
switch($child->nodeName){
case 'br':
echo 'BREAK: ',PHP_EOL;
break;
case 'p':
echo 'PARAGRAPH: ',PHP_EOL;
processElement($child);
echo 'END OF PARAGRAPH;',PHP_EOL;
break;
// etc.
// other cases:
default:
processElement($child);
}
}
}
}
$D = new DOMDocument;
$D->loadHTML('<p>Multiple-line paragraph<br />that has a close tag</p>');
processElement($D);
这将输出:
PARAGRAPH:
Multiple-line paragraph
BREAK:
that has a close tag
END OF PARAGRAPH;
答案 1 :(得分:2)
由于您不必处理子节点和其他内容,为什么不直接替换br?
$str = '<p>Multiple-line paragraph<br />that has<br>a close tag</p>';
echo preg_replace('/<br\s*\/?>/', "\n", $str);
输出:
<p>Multiple-line paragraph
that has
a close tag</p>
替代方案(使用Dom):
$str = '<p>Multiple-line<BR>paragraph<br />that<BR/>has<br>a close<Br>tag</p>';
$dom = new DomDocument();
$dom->loadHtml($str);
// using xpath here, because it will find every br-tag regardless
// of it being self-closing or not
$xpath = new DomXpath($dom);
foreach ($xpath->query('//br') as $br) {
$br->parentNode->replaceChild($dom->createTextNode("\n"), $br);
}
// output whole html
echo $dom->saveHtml();
// or just the body child-nodes
$output = '';
foreach ($xpath->query('//body/*') as $bodyChild) {
$output .= $dom->saveXml($bodyChild);
}
echo $output;
答案 2 :(得分:1)
我写了一个不使用递归的简单类,应该更快/消耗更少的内存,但基本上与@Hrant Khachatrian相同的原始概念(遍历所有元素并查找子标签):
class DomScParser {
public static function find(DOMNode &$parent_node, $tag_name) {
//Check if we already got self-contained node
if (!$parent_node->childNodes->length) {
if ($parent_node->nodeName == $tag_name) {
return $parent_node;
}
}
//Initialize path array
$dom_path = array($parent_node->firstChild);
//Initialize found nodes array
$found_dom_arr = array();
//Iterate while we have elements in path
while ($dom_path_size = count($dom_path)) {
//Get last elemant in path
$current_node = end($dom_path);
//If it is an empty element - nothing to do here,
//we should step back in our path.
if (!$current_node) {
array_pop($dom_path);
continue;
}
if ($current_node->firstChild) {
//If node has children - add it first child to end of path.
//As we are looking for self-contained nodes without children,
//this node is not what we are looking for - change corresponding
//path elament to his sibling.
$dom_path[] = $current_node->firstChild;
$dom_path[$dom_path_size - 1] = $current_node->nextSibling;
} else {
//Check if we found correct node, if not - change corresponding
//path elament to his sibling.
if ($current_node->nodeName == $tag_name) {
$found_dom_arr[] = $current_node;
}
$dom_path[$dom_path_size - 1] = $current_node->nextSibling;
}
}
return $found_dom_arr;
}
public static function replace(DOMNode &$parent_node, $search_tag_name, $replace_tag) {
//Check if we got Node to replace found node or just some text.
if (!$replace_tag instanceof DOMNode) {
//Get DomDocument object
if ($parent_node instanceof DOMDocument) {
$dom = $parent_node;
} else {
$dom = $parent_node->ownerDocument;
}
$replace_tag=$dom->createTextNode($replace_tag);
}
$found_tags = self::find($parent_node, $search_tag_name);
foreach ($found_tags AS &$found_tag) {
$found_tag->parentNode->replaceChild($replace_tag->cloneNode(),$found_tag);
}
}
}
$D = new DOMDocument;
$D->loadHTML('<span>test1<br />test2</span>');
DomScParser::replace($D, 'br', "\n");
P.S。此外,它不会破坏多个嵌套标签,因为它不使用递归。示例html:
$html=str_repeat('<b>',100).'<br />'.str_repeat('</b>',100);