修改

Question

使用PHP的 DOMDocument 解析HTML时遇到问题。

我正在解析的HMTL具有以下脚本标记：

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

此代码段有两个问题：

1）buttonWithCountTemplate var中的HTML未被转义。 DOMDocument正确地管理它，在解析时转义字符。不是问题。

2）接近结尾，有一个img标签带有未转义的结束标签：

<img src="$iconImg" />

/>使DOMDocument认为脚本已完成但缺少结束标记。如果您使用getElementByTagName提取脚本，则会在此img标记处关闭标记，其余的将在HTML 上显示作为文本。

我的目标是删除此页面中的所有脚本，因此如果我对此标记执行removeChild()，则会删除该标记，但在呈现页面时，以下部分将显示为文本：

</div><div class="sCountBox">$count</div></a></div>', } </script>

修复HTML不是解决方案，因为我正在开发通用解析器，需要处理所有类型的HTML。

我的问题是，在将HTML提供给DOMDocument之前是否应该进行任何清理，或者是否有选项可以在DOMDocument上启用以避免触发此问题，或者即使我可以在加载HTML之前删除所有标记。

有什么想法吗？

修改

经过一番研究，我发现了DOMDocument解析器的真正问题。请考虑以下HTML：

<div>  <script type="text/javascript"> var test = '</div>'; // I should not appear on the result </script>

使用以下php代码删除脚本标记（based on Gholizadeh's answer）：

<?php error_reporting(E_ALL); ini_set('display_errors', 1); $dom = new DOMDocument; $dom->preserveWhiteSpace = false; libxml_use_internal_errors(true); $dom->loadHTML(file_get_contents('js.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); //@$dom->loadHTMLFile('script.html'); //fix tags if not exist while($nodes = $dom->getElementsByTagName("script")) { if($nodes->length == 0) break; $script = $nodes->item(0); $script->parentNode->removeChild($script); } //return $dom->saveHTML(); $final = $dom->saveHTML(); echo $final;

结果将如下：

<div>  <p>'; // I should not appear on the result </p></div>

问题是第一个div标签没有关闭，似乎DOMDocument将JS字符串中的div标签视为html而不是简单的JS字符串。

我该怎么做才能解决这个问题？请记住，修改HTML不是一个选项，因为我正在开发一个通用的解析器。

Answer 1

我在像这样的html文件上测试了以下代码：

<p>some text 1</p>
<img src="http://www.example.com/images/some_image_1.jpg">
<p>some text 2</p>
<p>some text 3</p>
<img src="http://www.example.com/images/some_image_2.jpg">

<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="$onClick"><div class="sBtn">$text<img src="$iconImg" /></div><div class="sCountBox">$count</div></a></div>',
    }
</script>

<p>some text 4</p>
<p>some text 5</p>
<img src="http://www.example.com/images/some_image_3.jpg">

php代码是：

<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);

    $dom = new DOMDocument;
    $dom->preserveWhiteSpace = false;
    @$dom->loadHTML(file_get_contents('script.html'), LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
    //@$dom->loadHTMLFile('script.html'); //fix tags if not exist 

    $nodes = $dom->getElementsByTagName("script");

    foreach($nodes as $i => $node){
        $script = $nodes->item($i);
        $script->parentNode->removeChild($script);
    }

    //return $dom->saveHTML();
    $dom->saveHtmlFile('script.html');

并且它适用于给定的示例我认为您应该使用我在加载HTML代码时使用的选项。

根据上一个问题更新编辑：

实际上你不能用正则表达式解析[X] HTML（有关更多信息，请阅读此link）但如果你的唯一目的是删除脚本标记，你可以确保它之间没有</script>标记作为字符串。你可以使用这个正则表达式：

$html = mb_convert_encoding(file_get_contents('script2.html'), 'HTML-ENTITIES', 'UTF-8');
$new_html = preg_replace('/<script(.*?)>(.*?)<\/script>/si', '', $html);
file_put_contents('script-result.html', $new_html);

坦率地说，问题是你可能没有标准的HTML代码。但我认为最好尝试其他链接here的库。

否则我猜你应该写一个特殊的解析器来删除脚本标记，并在里面处理单引号和双引号。

Answer 2

我正在为您的问题提供不同的方法：

我的目标是删除此页面中的所有脚本

然后您可以使用preg_replace_callback函数删除它们，然后将html解析为DOM。这是工作演示：demo

$htmlWithScript = "<html><body><div>something></div><script type=\"text/javascript\">
var showShareBarUI_params_e81 =
{
    buttonWithCountTemplate: '<div class=\"sBtnWrap\"><a href=\"#\" onclick=\"\$onClick\"><div class=\"sBtn\">\$text<img src=\"\$iconImg\" /></div><div class=\"sCountBox\">\$count</div></a></div>',
}
</script></body></html>";



$htmlWithoutScript = preg_replace_callback('~<script.*>.*</script>~Uis', function($matches){
return '';
}, $htmlWithScript);

修改

但是如果不召唤克苏鲁怎么办呢？

很好的评论，但我不知道你在问什么:) 如果它正在加载html，那么你可以使用file_get_contents（）
加载html
如果你不明白它将如何删除标签： preg_replace_callback允许您搜索匹配regexp并转换它们。在这种情况下删除它们（返回'';） Regexp正在寻找带有任何属性（。*）的开始标记以及结束标记
之间的任何内容
变质剂：

U - ＆gt;意味着不合适（最短匹配）

i - ＆gt;不区分大小写（也将匹配）

s - ＆gt;空格包括在内。（点）被追究（换行不会打破）

我希望这有点澄清......

Answer 3

您是否尝试过设置libxml以使用内部错误？

$use_errors = libxml_use_internal_errors(true);
// your parsing code here
libxml_clear_errors();
libxml_use_internal_errors($use_errors);

它可能允许dom文档继续解析（可能）。

Answer 4

解析HTML文档主要是关于它的内容而不是脚本。在不知道其行为和来源的情况下，特别使用这些脚本可能是危险且有风险的。

因此，当谈到html内容时，您可以使用这种方法省略脚本（我已在评论中指出）： How to combine PHP's DOMDocument with a JavaScript template

具体说明你的例子：

<?php
$html = <<<END
<!DOCTYPE html>
<html><body><h1>Hey now</h1>
<script type="text/javascript">
    var showShareBarUI_params_e81 =
    {
        buttonWithCountTemplate: '<div class="sBtnWrap"><a href="#" onclick="onClick"><div class="sBtn">text<img src="iconImg" /></div><div class="sCountBox">count</div></a></div>'
    }
</script>
</body></html>
END;

$dom = new DOMDocument();
$dom->preserveWhiteSpace = true; // needs to be before loading, to have any effect
$dom->loadXML($html);
    while (($r = $dom->getElementsByTagName("script")) && $r->length) {
        $r->item(0)->parentNode->removeChild($r->item(0));
    }
$dom->formatOutput = false;
print $dom->saveHTML();

//Outputs
//<!DOCTYPE html><html><head></head><body><h1>Hey now</h1></body></html>

在加载到DOMDocument或检查其他html解析库之前，您还可以尝试使用一些正则表达式来删除脚本标记。最后你必须意识到，在某些情况下，即使是完美的表达也会破坏，DOMDocument解析器也不如真正的浏览器引擎。一切都是为了解析并找到最佳解决方案。

PHP Simple HTML DOM Parser示例：

http://simplehtmldom.sourceforge.net/manual.htm

require_once 'libs/simplehtmldom_1_5/simple_html_dom.php';
$html = <<<END
<div> <!-- Offending div without closing tag -->
<script type="text/javascript">
       var test = '</div>';
       // I should not appear on the result
</script>
END;

$dom = str_get_html($html);
echo $dom;

//outputs with no error or warnings
//<div> <!-- Offending div without closing tag --><script type="text/javascript">var test = '</div>';// I should not appear on the result  </script>

PHP DOMDocument：解析非转义字符串时出错

修改

4 个答案: