Question

我有一个生成HTML电子邮件的php脚本。为了优化大小，以免违反Google的102kB限制，我正在尝试从代码中尽量挤出不必要的字符。

我目前使用Emogrifier内联css，然后使用TinyMinify进行缩小。

此输出的内容在内联样式（例如style="color: #ffffff; font-weight: 16px"）的属性和值之间仍然有空格

我已经开发了以下正则表达式来删除多余的空格，但它也会影响实际内容（例如，此＆变为此＆that）

$out = preg_replace("/(;|:)\s([a-zA-Z0-9#])/", "$1$2", $newsletter);

如何修改此正则表达式以限制为内联样式，还是有更好的方法？

Answer 1

没有不匹配有效载荷（style=""可以出现在任何地方）和不匹配实际CSS值（如content: 'a: b'）的防弹方法。此外，还要考虑

缩短值：red短于#f00，短于#ff0000
删除前导和尾随伪造品，例如空格和分号
重新设计HTML：例如，使用<ins>和<strong>比使用内联CSS短得多

一种方法是先匹配所有内联样式HTML属性，然后仅对它们的内容进行操作，但是您必须自己测试一下它的工作原理：

$out= preg_replace_callback
( '/( style=")([^"]*)("[ >])/'  // Find all appropriate HTML attributes
, function( $aMatch ) {  // Per match
    // Kill any amount of any kind of spaces after colon or semicolon only
    $sInner= preg_replace
    ( '/([;:])\\s*([a-zA-Z0-9#])/'  // Escaping backslash in PHP string context
    , '$1$2'
    , $aMatch[2]  // Second sub match
    );

    // Kill any amount of leading and trailing semicolons and/or spaces
    $sInner= preg_replace
    ( array( '/^\\s*;*\\s*/', '/\\s*;*\\s*$/' )
    , ''
    , $sInner
    );

    return $aMatch[1]. $sInner. $aMatch[3];  // New HTML attribute
  }
, $newsletter
);

Answer 2

您尚未提供供我们使用的示例输入，但您提到要处理html。这应该发出警钟，认为使用正则表达式作为直接解决方案是不明智的。打算处理有效的html时，应使用dom解析器隔离样式属性。

为什么不使用正则表达式隔离内联样式声明？ 简而言之：正则表达式是“无意识的”。它不知道它是在标签内部还是外部（我将在演示中提供人为设计的Monkeywrench来表达此漏洞。此外，使用dom解析器将带来正确处理不同类型引用的好处，尽管可以将regex编写为匹配/确认平衡的引用，但这样会增加大量的膨胀（如果执行得当），并且会损害脚本的可读性和可维护性。

在我的演示中，我将演示在隔离真正的内联样式声明之后如何简单/准确地清除冒号，分号和逗号后的空格。我已经走了更远（因为本页提到了彩色十六进制代码压缩），以显示正则表达式如何用于将 some 六个字符的十六进制代码减少为三个字符。

代码：（Demo）

$html = <<<HTML
<div style='font-family: "Times New Roman", Georgia, serif; background-color: #ffffff; '>
  <p>Some text 
    <span class="ohyeah" style="font-weight: bold; color: #ff6633 !important; border: solid 1px grey;">
      Monkeywrench: style="padding: 3px;"
    </span>
    &
    <strong style="text-decoration: underline; ">Underlined</strong>
  </p>
  <h1 style="margin: 1px 2px 3px 4px;">Heading</h1>
  <span style="background-image:     url('images/not_a_hexcode_ffffff.png');    ">Text</span>
</div>
HTML;

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('*') as $node) {
    $style = $node->getAttribute('style');
    if ($style) {
        $patterns = ['~[:;,]\K\s+~', '~#\K([\da-f])\1([\da-f])\2([\da-f])\3~i'];
        $replaces = ['', '\1\2\3'];
        $node->setAttribute('style', preg_replace($patterns, $replaces, $style));
    }
}
$html = $dom->saveHtml();
echo $html;

输出：

<div style='font-family:"Times New Roman",Georgia,serif;background-color:#fff;'>
  <p>Some text 
    <span class="ohyeah" style="font-weight:bold;color:#f63 !important;border:solid 1px grey;">
      Monkeywrench: style="padding: 3px;"
    </span>
    &amp;
    <strong style="text-decoration:underline;">Underlined</strong>
  </p>
  <h1 style="margin:1px 2px 3px 4px;">Heading</h1>
  <span style="background-image:url('images/not_a_hexcode_ffffff.png');">Text</span>
</div>

以上代码段在模式中使用\K，以避免使用环视和过多的捕获组。

我没有写一种删除!important之前的空格的模式，因为我已经阅读了一些（不是最近）的帖子，有些浏览器表示没有空格的错误行为。

如何删除内联样式中的空格？

2 个答案: