为什么xpath会更改HTML?

时间:2017-01-29 15:41:45

标签: php xpath

这是我的代码:Online Demo

$html_string = <<<STR
<p>paragraph<a>link</a></p>
<div class="myclass">
    <div>something</div>
    <div style="mystyle">something</div>
    <b><a href="#">link</a></b>
    <b><a href="#" name="a name">link</a></b>
    <b style="color:red">bold</b>
    <img src="../path" alt="something" />
    <img src="../path" alt="something" class="myclass" />
</div>
STR;

$dom = new DOMDocument;
$dom->loadHTML(mb_convert_encoding($html_string, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {             
    if($node->nodeName != "src" && $node->nodeName != "href" && $node->nodeName != "alt") {
        $node->parentNode->removeAttribute($node->nodeName);
    }
}

echo $dom->saveHTML(); 

正如您在演示中看到的那样,</p>的位置在输出中不正确。我的意思是它的位置已经改变了。为什么?我该如何解决这个问题?

1 个答案:

答案 0 :(得分:1)

每个DOMDocument都需要一个根节点。对于HTML文档,它通常是<html>节点。

由于根节点必需,在您的情况下 libXML占用第一个节点,您的p元素作为根节点
这就是为什么下一个节点div[@class="myclass"]成为p元素的孩子$dom->saveHTML();

将代码包裹在<html>之类的根节点中以解决您的问题

$html_string = <<<STR
<html>
<p>paragraph<a>link</a></p>
<div class="myclass">
    <div>something</div>
    <div style="mystyle">something</div>
    <b><a href="#">link</a></b>
    <b><a href="#" name="a name">link</a></b>
    <b style="color:red">bold</b>
    <img src="../path" alt="something" />
    <img src="../path" alt="something" class="myclass" />
</div>
</html>
STR;

$dom = new DOMDocument;
$dom->loadHTML(mb_convert_encoding($html_string, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//@*');
foreach ($nodes as $node) {             
    if($node->nodeName != "src" && $node->nodeName != "href" && $node->nodeName != "alt") {
        $node->parentNode->removeAttribute($node->nodeName);
    }
}

echo $dom->saveHTML();