Question

我正在寻找一种从字符串中去除所有空HTML标记对的方法，例如和。虽然为此目的找到正则表达式相对容易，但我找不到可以与PHP preg_replace()一起使用的正则表达式。这是我尝试过的功能之一（取自https://stackoverflow.com/a/5573115/1784564）：

function strip_empty_tags($text) {
    // Match empty elements (attribute values may have angle brackets).
    $re = '%
        # Regex to match an empty HTML 4.01 Transitional element.
        <                    # Opening tag opening "<" delimiter.
        ((?!iframe)\w+)\b    # $1 Tag name.
        (?:                  # Non-capture group for optional attribute(s).
          \s+                # Attributes must be separated by whitespace.
          [\w\-.:]+          # Attribute name is required for attr=value pair.
          (?:                # Non-capture group for optional attribute value.
            \s*=\s*          # Name and value separated by "=" and optional ws.
            (?:              # Non-capture group for attrib value alternatives.
              "[^"]*"        # Double quoted string.
            | \'[^\']*\'     # Single quoted string.
            | [\w\-.:]+      # Non-quoted attrib value can be A-Z0-9-._:
            )                # End of attribute value alternatives.
          )?                 # Attribute value is optional.
        )*                   # Allow zero or more attribute=value pairs
        \s*                  # Whitespace is allowed before closing delimiter.
        >                    # Opening tag closing ">" delimiter.
        \s*                  # Content is zero or more whitespace.
        </\1\s*>             # Element closing tag.
        %x';
    while (preg_match($re, $text)) {
        // Recursively remove innermost empty elements.
        $text = preg_replace($re, '', $text);
    }

    return $text;
}

这是我一直在测试的HTML：

<strong class="a b">Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.l<br class="a  b" />fd<br class="a  b" /><br class="a  b" /></strong><strong class="a b"></strong><strong class="a b"><br class="a  b" /></strong><strong class="a b"></strong><br class="a  b" /><strong class="a b"><br class="a  b" /><br class="a  b" /></strong>

到目前为止，我尝试过的所有方法（已经过了4个多小时）似乎剥离了一些但不是所有标签，这让我感到疯狂。任何帮助将不胜感激。

Answer 1

需要unicode regex作为样本＆＃34;空＆＃34;标签实际上是not empty：

$re = '~<(\w+)[^>]*>[\p{Z}\p{C}]*</\1>~u';

\p{Z} ...任何类型的空白或不可见的分隔符
\p{C} ...不可见的控制字符和未使用的代码点

已使用u (PCRE_UTF8) modifier; test at regex101

还要包含 ， 作为空元素：

$re = '~<(\w+)[^>]*>(?>[\p{Z}\p{C}]|<br\b[^>]*>)*</\1>~ui';

test at regex 101

还要将标签与空间实体匹配

$re = '~<(\w+)[^>]*>(?>[\p{Z}\p{C}]|<br\b[^>]*>|&(?:(?:nb|thin|zwnb|e[nm])sp|zwnj|#xfeff|#xa0|#160|#65279);)*</\1>~iu'

test at regex101;根据您的需要进行修改。

使用recursive regex（不使用while循环）

$re = '~<(\w+)[^>]*>(?>[\p{Z}\p{C}]|<br\b[^>]*>|&(?:(?:nb|thin|zwnb|e[nm])sp|zwnj|#xfeff|#xa0|#160|#65279);|(?R))*</\1>~iu';

test at regex101

Answer 2

根据我对Jonny 5的回答发表评论;我已经在递归正则表达式中添加了几个可接受的标记，因为iframe和canvas通常可以为空。

$re = '~<((?!iframe|canvas)\w+)[^>]*>(?>[\p{Z}\p{C}]|<br\b[^>]*>|&(?:(?:nb|thin|zwnb|e[nm])sp|zwnj|#xfeff|#xa0|#160|#65279);|(?R))*</\1>~iu';

在PHP中删除所有空的HTML标记对

2 个答案: