Question

每当我们通过数据库或类似来源的某些编辑来获取一些用户输入的内容时，我们可能会检索仅包含开始标记但没有关闭的部分。

这可能会妨碍网站的当前布局。

是否有客户端或服务器端方法来解决此问题？

Answer 1

找到了一个很好的答案：

使用PHP 5并使用DOMDocument对象的loadHTML（）方法。这自动解析格式错误的HTML，随后调用saveXML（）将输出有效的HTML。 DOM函数可以在这里找到：

http://www.php.net/dom

使用：

$doc = new DOMDocument();
$doc->loadHTML($yourText);
$yourText = $doc->saveHTML();

Answer 2

您可以使用Tidy：

Tidy是Tidy HTML清理和修复实用程序的绑定，它不仅允许您清理和操作HTML文档，还可以遍历文档树。

或HTMLPurifier

HTML Purifier符合标准编写的HTML过滤器库 PHP。 HTML Purifier不仅会删除所有恶意内容代码（更好地称为XSS）经过全面审核，安全而宽松的白名单，它还将确保您的文件符合标准，只有通过a才能实现全面了解W3C的规范。

Answer 3

我有php的解决方案

<?php
    // close opened html tags
    function closetags ( $html )
        {
        #put all opened tags into an array
        preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result );
        $openedtags = $result[1];

        #put all closed tags into an array
        preg_match_all ( "#</([a-z]+)>#iU", $html, $result );
        $closedtags = $result[1];
        $len_opened = count ( $openedtags );

        # all tags are closed
        if( count ( $closedtags ) == $len_opened )
        {
            return $html;
        }
        $openedtags = array_reverse ( $openedtags );

        # close tags
        for( $i = 0; $i < $len_opened; $i++ )
        {
            if ( !in_array ( $openedtags[$i], $closedtags ) )
            {
                $html .= "</" . $openedtags[$i] . ">";
            }
            else
            {
                unset ( $closedtags[array_search ( $openedtags[$i], $closedtags)] );
            }
        }
        return $html;
    }
    // close opened html tags
?>

您可以使用此功能，如

   <?php echo closetags("your content <p>test test"); ?>

Answer 4

对于HTML片段，并且从KJS's answer开始工作，当片段有一个根元素时，我已成功完成以下任务：

SchemaNEW.TableNEW

如果没有根元素，这是可能的（但似乎只包含$dom = new DOMDocument(); $dom->loadHTML($string); $body = $dom->documentElement->firstChild->firstChild; $string = $dom->saveHTML($body);中p标签中的第一个文本子节点）：

text <p>para</p> text

或者更好，从PHP＆gt; = 5.4和libxml＆gt; = 2.7.8（$dom = new DOMDocument(); $dom->loadHTML($string); $bodyChildNodes = $dom->documentElement->firstChild->childNodes; $string = ''; foreach ($bodyChildNodes as $node){ $string .= $dom->saveHTML($node); }为2.7.7）：

LIBXML_HTML_NOIMPLIED

Answer 5

除了像Tidy这样的服务器端工具，您还可以使用用户的浏览器为您进行一些清理工作。关于innerHTML的一个非常棒的事情是，它将对动态内容应用与HTML页面相同的即时修复。这段代码工作得很好（有两个警告），实际上没有任何内容写入页面：

var divTemp = document.createElement('div');
divTemp.innerHTML = '<p id="myPara">these <i>tags aren\'t <strong> closed';
console.log(divTemp.innerHTML);

警告：

不同的浏览器会返回不同的字符串。这不是很糟糕，除了在IE的情况下，它将返回大写标签并将从标签属性中删除引号，这将不会通过验证。这里的解决方案是在服务器端进行一些简单的清理。但至少文档将是正确结构化的XML。
我怀疑你在阅读innerHTML之前可能需要延迟 - 给浏览器一个机会来消化字符串 - 或者你冒回到确切的内容。我只是尝试了IE8看起来字符串会立即被解析，但我对IE6不太确定。最好在延迟后读取innerHTML（或将其抛入setTimeout（）以强制它到队列的末尾）。

我建议你接受@ Gordon的建议并使用Tidy，如果你有权访问它（实现它需要的工作量更少）并且失败，请使用innerHTML并在PHP中编写自己的整洁函数。

虽然这不是您问题的一部分，因为这适用于CMS，但请考虑使用YUI 2 Rich Text Editor这样的内容。它实现起来相当容易，有些容易定制，大多数用户都非常熟悉这个界面，并且它会发出完全有效的代码。还有其他几个现成的富文本编辑器，但YUI拥有最好的许可证，是我见过的最强大的。

Answer 6

更好的PHP函数，用于从webmaster-glossar.de（我）删除未打开/未关闭的标记

function closetag($html){
    $html_new = $html;
    preg_match_all ( "#<([a-z]+)( .*)?(?!/)>#iU", $html, $result1);
    preg_match_all ( "#</([a-z]+)>#iU", $html, $result2);
    $results_start = $result1[1];
    $results_end = $result2[1];
    foreach($results_start AS $startag){
        if(!in_array($startag, $results_end)){
            $html_new = str_replace('<'.$startag.'>', '', $html_new);
        }
    }
    foreach($results_end AS $endtag){
        if(!in_array($endtag, $results_start)){
            $html_new = str_replace('</'.$endtag.'>', '', $html_new);
        }
    }
    return $html_new;
}

使用此功能，如：

closetag('i <b>love</b> my <strike>cat'); 
#output: i <b>love</b> my cat

closetag('i <b>love</b> my cat</strike>'); 
#output: i <b>love</b> my cat

Answer 7

Erik Arvidsson在2004年写了一篇不错的HTML SAX解析器。http://erik.eae.net/archives/2004/11/20/12.18.31/

它跟踪打开的标签，因此使用简约的SAX处理程序，可以在正确的位置插入结束标记：

function tidyHTML(html) {
    var output = '';
    HTMLParser(html, {
        comment: function(text) {
            // filter html comments
        },
        chars: function(text) {
            output += text;
        },
        start: function(tagName, attrs, unary) {
            output += '<' + tagName;
            for (var i = 0; i < attrs.length; i++) {
                output += ' ' + attrs[i].name + '=';
                if (attrs[i].value.indexOf('"') === -1) {
                    output += '"' + attrs[i].value + '"';
                } else if (attrs[i].value.indexOf('\'') === -1) {
                    output += '\'' + attrs[i].value + '\'';
                } else { // value contains " and ' so it cannot contain spaces
                    output += attrs[i].value;
                }
            }
            output += '>';
        },
        end: function(tagName) {
            output += '</' + tagName + '>';
        }
    });
    return output;
}

Answer 8

我曾经使用过本机DOMDocument方法，但是为了安全起见做了一些改进。

注意，使用DOMDocument的其他答案不考虑html链，例如

This is a <em>HTML</em> strand

以上实际上会导致

<p>This is a <em>HTML</em> strand

我的解决方案如下

function closeDanglingTags($html) {
    if (strpos($html, '<') || strpos($html, '>')) {
        // There are definitiley HTML tags
        $wrapped = false;
        if (strpos(trim($html), '<') !== 0) {
            // The HTML starts with a text node. Wrap it in an element with an id to prevent the software wrapping it with a <p>
            //  that we know nothing about and cannot safely retrieve
            $html = cHE::getDivHtml($html, null, 'closedanglingtagswrapper');
            $wrapped = true;
        }
        $doc = new DOMDocument();
        $doc->encoding = 'utf-8';
        @$doc->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
        if ($doc->firstChild) {
            // Test whether the firstchild is definitely a DOMDocumentType
            if ($doc->firstChild instanceof DOMDocumentType) {
                // Remove the added doctype
                $doc->removeChild($doc->firstChild);
            }
        }
        if ($wrapped) {
            // The contents originally started with a text node and was wrapped in a div#plasmappclibtextwrap. Take the contents
            //  out of that div
            $node = $doc->getElementById('closedanglingtagswrapper');
            $children = $node->childNodes;  // The contents of the div. Equivalent to $('selector').children()
            $doc = new DOMDocument();   // Create a new document to add the contents to, equiv. to "var doc = $('<html></html>');"
            foreach ($children as $childnode) {
                $doc->appendChild($doc->importNode($childnode, true)); // E.g. doc.append()
            }
        }
        // Remove the added html,body tags
        return trim(str_replace(array('<html><body>', '</body></html>'), '', html_entity_decode($doc->saveHTML())));
    } else {
        return $html;
    }
}

8 个答案: