使用DOMDocument

时间:2018-01-29 07:23:46

标签: php html parsing dom

让我们假设我的$ html看起来像这样:

<!DOCTYPE html>
<html>
<head>
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type="text/javascript" src="/gui/default/tinymcecontent.js"></script>
    <script type="text/javascript" src="/includes/js/video-js/video.min.js"></script>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
    <script type="text/javascript">document.createElement("video");document.createElement("audio");document.createElement("track");</script>
    <script type"text/javascript" src="/includes/js/video-js/video.js"></script/>
    <link rel="stylesheet" href="/includes/js/video-js/video-js.css" />
</head>
<body style="font-family: arial;font-size: 12px;">
    <p> </p>
    <table width="100%">        
    </table>
</body>
</html>

当我尝试仅使用命令解析body标签内的元素时:

$dom = new DOMDocument();

libxml_use_internal_errors(true);
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
libxml_use_internal_errors(false);

$full_dom = $dom->getElementsByTagName('body')->item(0);

的结果
$dom->saveHTML($full_dom)

<body>\n<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>\n<p>\u00a0<\/p>\n<table width=\"100%\"><\/table>\n<\/body>

元素

<p>\/&gt;<link rel=\"stylesheet\" href=\"\/includes\/js\/video-js\/video-js.css\"><\/p>

来自哪里? 其他一切都很好,只是这个元素从head标签转移到body标签元素..

1 个答案:

答案 0 :(得分:1)

来自这条线:

<script type"text/javascript" src="/includes/js/video-js/video.js"></script/>

形成不良,应该是:

<script type="text/javascript" src="/includes/js/video-js/video.js"></script>

您必须在$dom->loadHTML()之后检查错误,看看发生了什么:

foreach (libxml_get_errors() as $error) {
    print_r($error);
}