Question

使用PHP正则表达式，我如何删除HTML标记（包括打开和关闭）以及<hr class="myclass" />等属性，而不删除非<dog>或<dog class="cat">等HTML标记？

非HTML标记是动态的，无法进行硬编码。

输入：

<b><> <<> <dog> <123> <" !> <!--...--> <!doctype> <hr class="myclass" /> </b>

输出应为：

<> <<> <dog> <123> <" !>

我正在考虑使用HTML Purifier，但首先我需要知道这是否可以在正则表达式中使用。

HTML标记引用：http://www.quackit.com/html/tags/

提前致谢=）

Answer 1

为了匹配（并删除）仅用于HTML 4.01元素的开始和结束标记，这个经过测试的PHP函数中的正则表达式将会做得非常好：

function strip_HTML_tags($text)
{ // Strips HTML 4.01 start and end tags. Preserves contents.
    return preg_replace('%
        # Match an opening or closing HTML 4.01 tag.
        </?                  # Tag opening "<" delimiter.
        (?:                  # Group for HTML 4.01 tags.
          ABBR|ACRONYM|ADDRESS|APPLET|AREA|A|BASE|BASEFONT|BDO|BIG|
          BLOCKQUOTE|BODY|BR|BUTTON|B|CAPTION|CENTER|CITE|CODE|COL|
          COLGROUP|DD|DEL|DFN|DIR|DIV|DL|DT|EM|FIELDSET|FONT|FORM|
          FRAME|FRAMESET|H\d|HEAD|HR|HTML|IFRAME|IMG|INPUT|INS|
          ISINDEX|I|KBD|LABEL|LEGEND|LI|LINK|MAP|MENU|META|NOFRAMES|
          NOSCRIPT|OBJECT|OL|OPTGROUP|OPTION|PARAM|PRE|P|Q|SAMP|
          SCRIPT|SELECT|SMALL|SPAN|STRIKE|STRONG|STYLE|SUB|SUP|S|
          TABLE|TD|TBODY|TEXTAREA|TFOOT|TH|THEAD|TITLE|TR|TT|U|UL|VAR
        )\b                  # End group of tag name alternative.
        (?:                  # Non-capture group for optional attribute(s).
          \s+                # Attributes must be separated by whitespace.
          [\w\-.:]+          # Attribute name is required for attr=value pair.
          (?:                # Non-capture group for optional attribute value.
            \s*=\s*          # Name and value separated by "=" and optional ws.
            (?:              # Non-capture group for attrib value alternatives.
              "[^"]*"        # Double quoted string.
            | \'[^\']*\'     # Single quoted string.
            | [\w\-.:]+      # Non-quoted attrib value can be A-Z0-9-._:
            )                # End of attribute value alternatives.
          )?                 # Attribute value is optional.
        )*                   # Allow zero or more attribute=value pairs
        \s*                  # Whitespace is allowed before closing delimiter.
        /?                   # Tag may be empty (with self-closing "/>" sequence.
        >                    # Opening tag closing ">" delimiter.
        | <!--.*?-->         # Or a (non-SGML compliant) HTML comment.
        | <!DOCTYPE[^>]*>    # Or a DOCTYPE.
        %six', '', $text);
}

CAVEATS：不删除脚本<? ... ?>。将删除这些结构中出现的任何开始或结束标记。无法正确解析通用的SGML兼容注释。不处理短标签。

编辑：为DOCTYPE和（非SGML严格的）HTML评论添加了匹配项。它现在正确传递OP中的测试数据。

EDIT2 之前的版本缺少's'单行修饰符。还在警告列表中添加了短标签。

Answer 2

考虑使用HTML Purifier并启用HTML.Proprietary option，然后使用HTML.Allowed option 明确列入白名单您希望保留的特定标记和属性。

请记住，使用正则表达式解析HTML可以轻松调用Zalgo的愤怒。不要嘲笑Zalgo。

Answer 3

使用名为strip_tags()的函数。它会删除所有HTML标记，因此它会保留您的“自定义”标记。如果没有，可以指定您不想删除的标签。

Answer 4

Dhon的另一种替代工作解决方案：

<?php
$exemption_array = array('<a href"http://www.autopartswarehouse.com/search/?searchType=global&N=0&Ntt=A1327630">');
function strip_HTML_tags_withExemptions( $str , $arrayExemption = array() ){
     //Notes $arrayExemption holds all string exemptions in form of tags example <a href"http://www.autopartswarehouse.com/search/?searchType=global&N=0&Ntt=A1327630">
    foreach( $arrayExemption as $k => $exemptions )
        $str = str_replace($exemptions, " " , $str);
    $str = preg_replace("/<\/?(!DOCTYPE|a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdo|big|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|command|datalist|dd|del|details|dfn|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h\d|head|header|hgroup|hr|html|i|iframe|img|input|ins|keygen|kbd|label|legend|li|link|map|mark|menu|meta|meter|nav|noframes|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|tt|u|ul|var|video|wbr|xmp)((\s+\w+(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>|<!--(.)*-->/i" , " ", $str);
    $str = preg_replace('/\s\s+/', ' ', $str );
    $str = preg_replace('/[\.]+/', '.', $str );
    return $str;
}
?>

PHP Regex：如何删除所有HTML标记但不剥离非HTML标记？

4 个答案: