我正在使用XPath操作一个简短的HTML代码段;当我使用$ doc-> saveHTML()输出更改后的代码段时,会添加DOCTYPE
,并且HTML / BODY
标记会包装输出。我想删除它们,但只使用DOMDocument函数将所有子项保留在内部。例如:
$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<a href="http://www....."><img src="http://" alt=""></a>
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here
echo htmlentities( $doc->saveHTML() );
这会产生:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html><body>
<p><strong>Title...</strong></p>
<a href="http://www....."><img src="http://" alt=""></a>
<p>...to be one of those crowning achievements...</p>
</body></html>
我尝试了一些简单的技巧,例如:
# removes doctype
$doc->removeChild($doc->firstChild);
# <body> replaces <html>
$doc->replaceChild($doc->firstChild->firstChild, $doc->firstChild);
到目前为止,只删除DOCTYPE并用BODY替换HTML。然而,剩下的是身体&gt;此时可变数量的元素。
我如何删除<body>
标记,但保留所有的子标记,因为它们的结构可变,并且使用PHP的DOM操作以干净利落的方式进行?
答案 0 :(得分:15)
这是一个不扩展DOMDocument的版本,但我认为扩展是正确的方法,因为您正在尝试实现未内置于DOM API的功能。
注意:我正在将“干净”和“没有解决方法”解释为保持对DOM API的所有操作。一旦你点击字符串操作,这就是解决方法领域。
我正在做的事情,就像在原始答案中一样,利用DOMDocumentFragment来操纵所有位于根级别的节点。没有字符串操作,我认为这不是一种解决方法。
$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');
// Remove doctype node
$doc->doctype->parentNode->removeChild($doc->doctype);
// Remove html element, preserving child nodes
$html = $doc->getElementsByTagName("html")->item(0);
$fragment = $doc->createDocumentFragment();
while ($html->childNodes->length > 0) {
$fragment->appendChild($html->childNodes->item(0));
}
$html->parentNode->replaceChild($fragment, $html);
// Remove body element, preserving child nodes
$body = $doc->getElementsByTagName("body")->item(0);
$fragment = $doc->createDocumentFragment();
while ($body->childNodes->length > 0) {
$fragment->appendChild($body->childNodes->item(0));
}
$body->parentNode->replaceChild($fragment, $body);
// Output results
echo htmlentities($doc->saveHTML());
这个解决方案相当冗长,但这是因为它通过扩展DOM来实现,以保持最终代码尽可能短。
sliceOutNode
是神奇发生的地方。如果您有任何问题,请与我们联系:
<?php
class DOMDocumentExtended extends DOMDocument
{
public function __construct( $version = "1.0", $encoding = "UTF-8" )
{
parent::__construct( $version, $encoding );
$this->registerNodeClass( "DOMElement", "DOMElementExtended" );
}
// This method will need to be removed once PHP supports LIBXML_NOXMLDECL
public function saveXML( DOMNode $node = NULL, $options = 0 )
{
$xml = parent::saveXML( $node, $options );
if( $options & LIBXML_NOXMLDECL )
{
$xml = $this->stripXMLDeclaration( $xml );
}
return $xml;
}
public function stripXMLDeclaration( $xml )
{
return preg_replace( "|<\?xml(.+?)\?>[\n\r]?|i", "", $xml );
}
}
class DOMElementExtended extends DOMElement
{
public function sliceOutNode()
{
$nodeList = new DOMNodeListExtended( $this->childNodes );
$this->replaceNodeWithNode( $nodeList->toFragment( $this->ownerDocument ) );
}
public function replaceNodeWithNode( DOMNode $node )
{
return $this->parentNode->replaceChild( $node, $this );
}
}
class DOMNodeListExtended extends ArrayObject
{
public function __construct( $mixedNodeList )
{
parent::__construct( array() );
$this->setNodeList( $mixedNodeList );
}
private function setNodeList( $mixedNodeList )
{
if( $mixedNodeList instanceof DOMNodeList )
{
$this->exchangeArray( array() );
foreach( $mixedNodeList as $node )
{
$this->append( $node );
}
}
elseif( is_array( $mixedNodeList ) )
{
$this->exchangeArray( $mixedNodeList );
}
else
{
throw new DOMException( "DOMNodeListExtended only supports a DOMNodeList or array as its constructor parameter." );
}
}
public function toFragment( DOMDocument $contextDocument )
{
$fragment = $contextDocument->createDocumentFragment();
foreach( $this as $node )
{
$fragment->appendChild( $contextDocument->importNode( $node, true ) );
}
return $fragment;
}
// Built-in methods of the original DOMNodeList
public function item( $index )
{
return $this->offsetGet( $index );
}
public function __get( $name )
{
switch( $name )
{
case "length":
return $this->count();
break;
}
return false;
}
}
// Load HTML/XML using our fancy DOMDocumentExtended class
$doc = new DOMDocumentExtended();
$doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');
// Remove doctype node
$doc->doctype->parentNode->removeChild( $doc->doctype );
// Slice out html node
$html = $doc->getElementsByTagName("html")->item(0);
$html->sliceOutNode();
// Slice out body node
$body = $doc->getElementsByTagName("body")->item(0);
$body->sliceOutNode();
// Pick your poison: XML or HTML output
echo htmlentities( $doc->saveXML( NULL, LIBXML_NOXMLDECL ) );
echo htmlentities( $doc->saveHTML() );
答案 1 :(得分:11)
saveHTML
可以输出文档的子集,这意味着我们可以通过遍历正文逐个输出每个子节点。
$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<a href="http://google.com"><img src="http://google.com/img.jpeg" alt=""></a>
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here
// Let's traverse the body and output every child node
$bodyNode = $doc->getElementsByTagName('body')->item(0);
foreach ($bodyNode->childNodes as $childNode) {
echo $doc->saveHTML($childNode);
}
这可能不是最优雅的解决方案,但它确实有效。或者,我们可以将所有子节点包装在某个容器元素(例如div
)中,并仅输出该容器(但容器标记将包含在输出中)。
答案 2 :(得分:2)
我在这里是怎么做到的:
- 快速帮助函数,为您提供特定DOM元素的HTML内容
function nodeContent($n, $outer=false) { $d = new DOMDocument('1.0'); $b = $d->importNode($n->cloneNode(true),true); $d->appendChild($b); $h = $d->saveHTML(); // remove outter tags if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4)); return $h; }
- 在您的文档中查找正文节点并获取其内容
$query = $xpath->query("//body")->item(0); if($query) { echo nodeContent($query); }
更新1:
一些额外信息:自PHP / 5.3.6起,DOMDocument-&gt; saveHTML()接受一个可选的DOMNode参数,类似于DOMDocument-&gt; saveXML()。你可以做到
$xpath = new DOMXPath($doc); $query = $xpath->query("//body")->item(0); echo $doc->saveHTML($query);
对于其他人,帮助函数将有所帮助
答案 3 :(得分:0)
tl; dr
需要:PHP 5.4.0
和Libxml 2.6.0
$doc->loadHTML("<p>test</p>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
说明
http://php.net/manual/en/domdocument.loadhtml.php “自PHP 5.4.0和Libxml 2.6.0起,您还可以使用options参数指定additional Libxml parameters.”
LIBXML_HTML_NOIMPLIED
设置HTML_PARSE_NOIMPLIED标志,该标志将关闭自动添加隐含html / body ...元素的功能。
LIBXML_HTML_NODEFDTD
设置HTML_PARSE_NODEFDTD标志,该标志可防止在未找到默认文档类型时添加默认文档类型。
答案 4 :(得分:-1)
您有两种方法可以实现这一目标:
$content = substr($content, strpos($content, '<html><body>') + 12); // Remove Everything Before & Including The Opening HTML & Body Tags.
$content = substr($content, 0, -14); // Remove Everything After & Including The Closing HTML & Body Tags.
或者更好的是这样:
$dom->normalizeDocument();
$content = $dom->saveHTML();