Question

这是我的PHP函数，用于从字符串输入中删除所有空HTML标记：

/**
 * Remove the nested HTML empty tags from the string.
 *
 * @param $string String to remove tags
 * @param null $replaceTo Replace empty string with
 * @return mixed Cleaned string
 */
function crl_remove_empty_tags($string, $replaceTo = null)
{
    // Return if string not given or empty
    if (!is_string($string) || trim($string) == '') return $string;

    // Recursive empty HTML tags
    return preg_replace(
        '/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm',
        !is_string($replaceTo) ? '' : $replaceTo,
        $string
    );
}

我的正则表达式：/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm

我使用http://gskinner.com/RegExr/和http://regexpal.com/对其进行了测试，效果很好。但是当我试图运行它时。服务器始终返回错误：

Warning: preg_replace(): Unknown modifier '\'

我不知道'''出了什么问题。有人请帮帮我！

Answer 1

在php正则表达式中，如果它们出现在表达式中，则需要转义分隔符。

在您的情况下，您有两个未转义的/;只需将其替换为\/即可。你也不需要修饰符数组 - 默认情况下php是全局的，你没有定义文字字符。

在：

/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm

后：

/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/
//                                                                    ^       ^

Answer 2

这种模式能够删除“空标签”（即不包含任何内容的非自动关闭标签，空格，html注释或其他“空标签”），即使这些标签嵌套为{{1} }。 html评论中的标签不会被考虑在内：

<span><span></span></span>

限制：

此方法将删除指向外部Javascript文件的链接：
$pattern = <<<'EOD' ~ < (?: !--[^-]*(?:-(?!->)[^-]*)*-->[^<]*(*SKIP)(*F) # skip comments | ( # group 1 (\w++) # tag name in group 2 [^"'>]* #'"# all that is not a quote or a closing angle bracket (?: # quoted attributes "[^\\"]*(?:\\.[^\\"]*)*+" [^"'>]* #'"# double quote | '[^\\']*(?:\\.[^\\']*)*+' [^"'>]* #'"# single quote )*+ > \s* (?:  \s* # html comments | <(?1) \s* # recursion with the group 1 )*+ </\2> # closing tag ) # end of the group 1 ) ~sxi EOD; $html = preg_replace($pattern, '', $html);
如果出现以下情况，该模式可能会删除部分嵌入式Javascript代码：
<script src="myscript.js"></script>
或者喜欢：
var myvar="<span></span>";已找到。

这些限制是由于基本文本方法无法区分html和javascript代码。但是，如果在模式跳过列表中添加“脚本”标记（以与html注释相同的方式），则可以解决此问题，但在这种情况下，您需要基本描述Javascript内容（字符串，注释，文字模式），这不是前三个，这不是一项微不足道的任务，但可能。

Answer 3

删除空元素......以及下一个空元素。

体育专业。

<p>Hello!
   <div class="foo"><p id="nobody">
   </p>
      </div>
 </p>

结果：

<p>Hello!</p>

Php代码：

/* $html store the html content */
do {
    $tmp = $html;
    $html = preg_replace( '#<([^ >]+)[^>]*>([[:space:]]|&nbsp;)*</\1>#', '', $html );
} while ( $html !== $tmp );

Answer 4

不太确定这是否是您需要的，但我今天发现了这一点。你需要PHP 5.4 +！

$oDOMHTML = DOMDocument::loadHTML( 
    $sYourHTMLString, 
    LIBXML_HTML_NOIMPLIED | 
    LIBXML_HTML_NODEFDTD | 
    LIBXML_NOBLANKS | 
    LIBXML_NOEMPTYTAG 
);
$sYourHTMLStringWithoutEmptyTags = $oDOMHTML->saveXML();

也许这适合你。

Answer 5

您也可以使用递归来解决此问题。继续将HTML blob传递回函数，直到空标记不再存在。

public static function removeHTMLTagsWithNoContent($htmlBlob) {
    $pattern = "/<[^\/>][^>]*><\/[^>]+>/";

    if (preg_match($pattern, $htmlBlob) == 1) {
        $htmlBlob = preg_replace($pattern, '', $htmlBlob);
        return self::removeHTMLTagsWithNoContent($htmlBlob);
    } else {
        return $htmlBlob;
    }
}

这将检查是否存在空HTML标记并替换它们，直到正则表达式模式不再匹配为止。

Answer 6

这是删除所有空标记的另一种方法。（如果由于空的孩子而被禁止为空，它也会删除surronding标签：

/**
 * Remove empty tags.
 * This one will also remove <p><a href="/foo/bar.baz"><span></span></a></p> (empty paragraph with empty link)
 * But it will not alter <p><a href="/foo/bar.baz"><span>[CONTENT HERE]</span></a></p> (since the span has content)
 *
 * Be aware: <img ../> will be treated as an empty tag!
 */
do
{
    $len1 = mb_strlen($string);
    $string = preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $string);
    $len2 = mb_strlen($string);

} while ($len1 > 0 && $len2 > 0 && $len1 != $len2);

我一直在使用它从外部CMS中清除html并获得积极的结果。

正则表达式删除所有空HTML标记

6 个答案: