Question

Google页面建议您缩小HTML，即删除所有不必要的空格。 CodeIgniter确实具有giziping输出功能，或者可以通过.htaccess完成。但我仍然希望从最终的HTML输出中删除不必要的空格。

我用这段代码玩了一下，它似乎有效。这确实会导致HTML没有多余的空格并删除其他标签格式。

class Welcome extends CI_Controller 
{
    function _output()
    {
        echo preg_replace('!\s+!', ' ', $output);
    }

    function index(){
    ...
    }
}

问题是可能有像这样的标签 <pre>，<textarea>等等，其中可能包含空格，正则表达式应删除它们。那么，如何从最终的HTML中删除多余的空格，而不使用正则表达式影响这些特定标记的空格或格式呢？

感谢@Alan Moore得到了答案，这对我有用

echo preg_replace('#(?ix)(?>[^\S ]\s*|\s{2,})(?=(?:(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+)(?:<(?>textarea|pre)\b|\z))#', ' ', $output);

ridgerunner在分析这个正则表达式方面做得很好。我最终使用了他的解决方案。为揭幕战欢呼。

Answer 1

对于那些对艾伦·摩尔的正则表达式如何运作感兴趣的人（是的，工作），我冒昧地评论它，以便它可以被凡人阅读：

function process_data_alan($text) // 
{
    $re = '%# Collapse ws everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          (?:           # Begin (unnecessary) group.
            (?:         # Zero or more of...
              [^<]++    # Either one or more non-"<"
            | <         # or a < starting a non-blacklist tag.
              (?!/?(?:textarea|pre)\b)
            )*+         # (This could be "unroll-the-loop"ified.)
          )             # End (unnecessary) group.
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %ix';
    $text = preg_replace($re, " ", $text);
    return $text;
}

我在这里很新，但是我可以看出Alan非常擅长正则表达式。我只会添加以下建议。

有一个不必要的捕获组可以删除。
虽然OP没有这样说，但<SCRIPT>元素应添加到<PRE>和<TEXTAREA>黑名单中。
添加'S' PCRE“study”修饰符可将此正则表达式提高约20％。
前瞻中有一个交替小组，适用于应用Friedl的“展开循环”效率构造。
更严重的是，这个相同的交替组:(即(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+）容易受到大型目标字符串上过多的PCRE递归的影响，这可能导致堆栈溢出导致Apache / PHP可执行文件默默地 seg-fault并且在没有警告的情况下崩溃。（Apache httpd.exe的Win32版本特别容易受到影响，因为与* nix可执行文件相比，它只有256KB的堆栈，而这些可执行文件通常以8MB或更多堆栈构建。）Philip Hazel（PCRE正则表达式引擎的作者）在PHP中使用）在文档中讨论了这个问题：PCRE DISCUSSION OF STACK USAGE。虽然Alan已经正确应用了与Philip在本文档中显示的相同的修复（对第一个替代品应用了占有性加），但如果HTML文件很大并且有很多非黑名单标签，那么仍然会有很多递归。例如在我的Win32盒子上（具有256KB堆栈的可执行文件），脚本会爆炸，测试文件只有60KB。另请注意，遗憾的是，PHP不遵循建议，并将默认递归限制设置为100000.（根据PCRE文档，应将其设置为等于堆栈大小除以500的值）。

这是一个改进版本，它比原版更快，处理更大的输入，如果输入字符串太大而无法处理，则优雅地失败并显示消息：

// Set PCRE recursion limit to sane value = STACKSIZE / 500
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache
ini_set("pcre.recursion_limit", "16777");  // 8MB stack. *nix
function process_data_jmr1($text) // 
{
    $re = '%# Collapse whitespace everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          [^<]*+        # Either zero or more non-"<" {normal*}
          (?:           # Begin {(special normal*)*} construct
            <           # or a < starting a non-blacklist tag.
            (?!/?(?:textarea|pre|script)\b)
            [^<]*+      # more non-"<" {normal*}
          )*+           # Finish "unrolling-the-loop"
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre|script)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %Six';
    $text = preg_replace($re, " ", $text);
    if ($text === null) exit("PCRE Error! File too big.\n");
    return $text;
}

P.S。我非常熟悉这个PHP / Apache seg-fault问题，因为我在帮助Drupal社区时正在努力解决这个问题。见：Optimize CSS option causes php cgi to segfault in pcre function "match"。我们还在FluxBB论坛软件项目中使用BBCode解析器来体验这一点。

希望这有帮助。

Answer 2

我在两个项目中实现了@ridgerunner的答案，并最终在其中一个项目的暂存中遇到了一些严重的减速（10-30秒请求时间）。我发现我必须将pcre.recursion_limit和pcre.backtrack_limit设置得相当低才能使它工作，但即使这样，它也会在大约2秒的处理后放弃并返回false。

从那以后，我用这个解决方案（更容易掌握的正则表达式）替换它，它受到Smarty 2的outputfilter.trimwhitespace函数的启发。它没有回溯或递归，并且每次都有效（而不是在蓝色的月亮中灾难性地失败了一次）：

function filterHtml($input) {
    // Remove HTML comments, but not SSI
    $input = preg_replace('/<!--[^#](.*?)-->/s', '', $input);

    // The content inside these tags will be spared:
    $doNotCompressTags = ['script', 'pre', 'textarea'];
    $matches = [];

    foreach ($doNotCompressTags as $tag) {
        $regex = "!<{$tag}[^>]*?>.*?</{$tag}>!is";

        // It is assumed that this placeholder could not appear organically in your
        // output. If it can, you may have an XSS problem.
        $placeholder = "@@<'-placeholder-$tag'>@@";

        // Replace all the tags (including their content) with a placeholder, and keep their contents for later.
        $input = preg_replace_callback(
            $regex,
            function ($match) use ($tag, &$matches, $placeholder) {
                $matches[$tag][] = $match[0];
                return $placeholder;
            },
            $input
        );
    }

    // Remove whitespace (spaces, newlines and tabs)
    $input = trim(preg_replace('/[ \n\t]+/m', ' ', $input));

    // Iterate the blocks we replaced with placeholders beforehand, and replace the placeholders
    // with the original content.
    foreach ($matches as $tag => $blocks) {
        $placeholder = "@@<'-placeholder-$tag'>@@";
        $placeholderLength = strlen($placeholder);
        $position = 0;

        foreach ($blocks as $block) {
            $position = strpos($input, $placeholder, $position);
            if ($position === false) {
                throw new \RuntimeException("Found too many placeholders of type $tag in input string");
            }
            $input = substr_replace($input, $block, $position, $placeholderLength);
        }
    }

    return $input;
}

Answer 3

对不起，您未发表评论，声誉丢失；）

我要敦促每个人在不检查性能损失的情况下不要实施这种正则表达式。 Shopware实施了第一个正则表达式（来自Alan / ridgerunner），以使HTML最小化并“炸毁”每一个拥有较大页面的商店。

如果可能的话，对于复杂的问题，组合的解决方案（正则表达式+其他逻辑）在大多数情况下会更快，更易于维护（除非您是Damian Conway）。

我还要提及的是，大多数脚本程序可以破坏您的代码（JavaScript和HTML），而在脚本块中本身就是通过document.write即另一个脚本块。

附加了我的解决方案（user2677898代码段的优化版本）。我简化了代码并运行了一些测试。在PHP 7.2下，我的特殊测试用例的版本速度提高了约30％。在PHP 7.3和7.4下，旧版本获得了很大的速度，但速度仅慢了10％。另外，由于代码复杂度较低，我的版本仍可更好地维护。

function filterHtml($content) {
{
    // List of untouchable HTML-tags.
    $unchanged = 'script|pre|textarea';

    // It is assumed that this placeholder could not appear organically in your
    // output. If it can, you may have an XSS problem.
    $placeholder = "@@<'-pLaChLdR-'>@@";

    // Some helper variables.
    $unchangedBlocks  = [];
    $unchangedRegex   = "!<($unchanged)[^>]*?>.*?</\\1>!is";
    $placeholderRegex = "!$placeholder!";

    // Replace all the tags (including their content) with a placeholder, and keep their contents for later.
    $content = preg_replace_callback(
        $unchangedRegex,
        function ($match) use (&$unchangedBlocks, $placeholder) {
            array_push($unchangedBlocks, $match[0]);
            return $placeholder;
        },
        $content
    );

    // Remove HTML comments, but not SSI
    $content = preg_replace('/<!--[^#](.*?)-->/s', '', $content);

    // Remove whitespace (spaces, newlines and tabs)
    $content = trim(preg_replace('/[ \n\t]{2,}|[\n\t]/m', ' ', $content));

    // Replace the placeholders with the original content.
    $content = preg_replace_callback(
        $placeholderRegex,
        function ($match) use (&$unchangedBlocks) {
            // I am a paranoid.
            if (count($unchangedBlocks) == 0) {
                throw new \RuntimeException("Found too many placeholders in input string");
            }
            return array_shift($unchangedBlocks);
        },
        $content
    );

    return $content;
}

使用CodeIgniter的正则表达式缩小最终的HTML输出

3 个答案: