DOMDocument删除脚本标记中的结束标记

时间:2015-10-30 00:02:59

标签: php dom domdocument

我有以下test.php文件,当我运行它时,关闭</h1>标记会被删除。

<?php

$doc = new DOMDocument();

$doc->loadHTML('<html>
    <head>
        <script>
            console.log("<h1>hello</h1>");
        </script>
    </head>
    <body>

    </body>
</html>');

echo $doc->saveHTML();

以下是执行文件时的结果:

PHP Warning:  DOMDocument::loadHTML(): Unexpected end tag : h1 in Entity, line: 4 in /home/ryan/NetBeansProjects/blog/test.php on line 14

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
    <head>
        <script>
            console.log("<h1>hello");
        </script>
    </head>
    <body>
    </body>
</html>

那么,为什么删除标签呢?这是一个字符串所以不应该忽略它吗?

3 个答案:

答案 0 :(得分:2)

我想到的唯一解决方案是preg匹配脚本标记,然后用<script id="myuniqueid"></script>之类的临时持有者替换它们,并在dom管理结束时再次使用实际脚本替换它们,如下所示:

//  The dom doc
$doc = new DOMDocument();

//  The html
$html = '<html>
    <head>
        <script>
            console.log("<h1>hello</h1>");
        </script>
    </head>
    <body>

    </body>
</html>';

//  Patter for scripts
$pattern = "/<script([^']*?)<\/script>/";
//  Get all scripts
preg_match_all($pattern, $html, $matches);

//  Only unique scripts
$matches = array_unique( $matches[0] );

//  Construct the arrays for replacement
foreach ( $matches as $match ) {
  //  The simple script
  $id = uniqid('script_');
  $uniqueScript = "<script id=\"$id\"></script>";
  $simple[] = $uniqueScript;
  //  The complete script
  $complete[] = $match;
}

//  Replace the scripts with the simple scripts
$html = str_replace($complete, $simple, $html);
//  load the html into the dom
$doc->loadHTML( $html);

//  Do the dom management here
//  TODO: Whatever you do with the dom

//  When finished
//  Get the html back
$html = $doc->saveHTML();
//  Replace the scripts back
$html = str_replace($simple, $complete, $html);
//Print the result
echo $html;

此解决方案打印干净,没有dom错误。

答案 1 :(得分:2)

LIBXML_SCHEMA_CREATE 传递给 loadHTML 选项。这将解决问题。

<?php

$doc = new DOMDocument();

libxml_use_internal_errors(true);

$doc->loadHTML(
  '<html>
    <head>
        <script>
            console.log("<h1>hello</h1>");
        </script>
    </head>
    <body>

    </body>
</html>',
  LIBXML_HTML_NODEFDTD | LIBXML_SCHEMA_CREATE
);

echo $doc->saveHTML();

答案 2 :(得分:0)

另一种选择是将</TAG>替换为<\/TAG>

<?php
$html = <<<'EOD'
<!DOCTYPE html>
<html>
<head>
    <script>
    console.log("<h1>hello</h1>");
    var foo = '<a href="#"></a>';
    var bar = "<a href=\"#\"></a>";
    </script>
</head>
<body>
</body>
</html>
EOD;

preg_match_all('/<script\b[^>]*>.*?<\/script>/s', $html, $matches);
$matches = array_unique( $matches[0] );
if( !empty($matches) ) {
    foreach ( $matches as $matches__value ) {
        $before = $matches__value;
        $after = $matches__value;
        preg_match_all('/<\/[a-zA-Z][a-zA-Z0-9]*>/', $matches__value, $matches_inner);
        $matches_inner = array_unique( $matches_inner[0] );
        if( !empty($matches_inner) ) {
            foreach($matches_inner as $matches_inner__value) {
                if($matches_inner__value === '</script>') { continue; }
                $after = str_replace($matches_inner__value, str_replace('/','\/',$matches_inner__value), $after);
            }
            $simple[] = $after;
            $complete[] = $before;
        }
    }
    $html = str_replace($complete, $simple, $html);
}

$DOMDocument = new \DOMDocument();
$DOMDocument->loadHTML($html);
$html = $DOMDocument->saveHTML();
echo $html;