我想使用DOM Document获取整个body标签内容。
我使用了以下代码:
$dom = new domDocument;
/*** load the html into the object ***/
$dom->loadHTML($html);
/*** the table by its tag name ***/
$tables = $dom->getElementsByTagName('body')->item(0)->nodeValue;
这给了我TExt。我想要全身内容。
答案 0 :(得分:12)
您可以将正文DOMElement传递给DOMDocument::saveHTML()或DOMDocument::saveHTMLFile(),例如
<?php
$doc = new DOMDocument;
$doc->loadhtmlfile('http://stackoverflow.com');
$body = $doc->getElementsByTagName('body');
if ( $body && 0<$body->length ) {
$body = $body->item(0);
echo $doc->savehtml($body);
}
打印
Warning: DOMDocument::loadHTMLFile(): Unexpected end tag : p in http://stackoverflow.com, line: 2843 [...]
<body class="home-page">
<noscript><div id="noscript-padding"></div></noscript>
<div id="notify-container"></div>
<div id="overlay-header"></div>
<div id="custom-header"></div>
<div class="container">
<div id="header">
<div id="portalLink">
[...]
答案 1 :(得分:4)
使用PHP整理扩展更安全,它可以修复无效的XHTML结构并仅提取正文:
$tidy = new tidy();
$htmlBody = $tidy->repairString($html, array(
'output-xhtml' => true,
'show-body-only' => true,
), 'utf8');
然后将提取的主体加载到DOMDocument:
$xml = new DOMDocument();
$xml->loadHTML($htmlBody);
答案 2 :(得分:0)
$dom = new domDocument;
$dom->loadHTML($html);
// ... change, replace ...
// ... mock, traverse ..
$body = $dom->documentElement->lastChild;
$dom->saveHTML($body);