第1部分：匹配不在@ font-face {}

Question

我有一个网络服务，可以在css文件中重写网址，以便通过CDN提供服务。

css文件可以包含图像或字体的URL。

我目前有以下正则表达式来匹配css文件中的所有网址：

(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))

但是，我现在想要引入对自定义字体的支持，并且需要定位@font-fontface中的网址：

@font-face {
  font-family: 'FontAwesome';
  src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
  src: url("fonts/fontawesome-webfont.eot?#iefix&v=4.0.3") format("embedded-opentype"), url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"), url("fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"), url("fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular") format("svg");
  font-weight: normal;
  font-style: normal;
}

然后我想出了以下内容：

@font-face\s*\{.*(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))\s*\}

问题在于，这与所有内容相匹配，而不仅仅是内部的网址。我以为我可以这样使用lookbehind：

(?<=@font-face\s*\{.*)(url\(\s*([\'\"]?+))((?!(https?\:|data\:|\.\.\/|\/))\S+)((\2)\s*\))(?<=-\s*\})

不幸的是，PCRE（PHP使用的）不支持lookbehind中的变量重复，所以我被卡住了。

我不希望通过其扩展程序检查字体，因为某些字体的.svg扩展名可能会与.svg扩展名的图片冲突。

此外，我还想修改我的原始正则表达式以匹配不在@font-face范围内的所有其他网址：

.someclass {
  background: url('images/someimage.png') no-repeat;
}

由于我无法使用lookbehinds，如何从@font-face中的网址和不在@font-face范围内的网址中提取网址？

Answer 1

_{免责声明：您可能不使用图书馆，因为它比您想象的更难。我还想就如何匹配不在 @ font-face {} 中的网址开始这个答案。我还假设/定义括号 {} 在 @ font-face {} 中保持平衡。

  注意：我将使用“〜”作为分隔符而不是“/”，这将使我不再在我的表达式中逃避。另请注意，我将从regex101.com发布在线演示，在该网站上我将使用 g 修饰符。您应该删除 g 修饰符，然后使用preg_match_all()
  让我们使用一些正则表达式！}

第1部分：匹配不在@ font-face {}

范围内的网址

1.1匹配@ font-face {}

哦，是的，这可能听起来“很奇怪”，但你会在后来注意到为什么:)
我们在这里需要一些递归正则表达式：

@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1

demo

1.2转义@ font-face {}

我们将在前一个正则表达式之后使用(*SKIP)(*FAIL)，它会跳过它。请参阅this answer以了解其工作原理。

demo

1.3匹配url（）

我们会使用这样的东西：

url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
\2               # Match what was matched in group 2
\s*              # Match optionally some whitespaces
\)               # Match )

请注意，我正在使用\2因为我已将此附加到包含组1的前一个正则表达式。
Here使用了("|')(?:[^\\]|\\.)*?\1。

demo

1.4匹配url（）

中的值

您可能已经猜到我们需要使用一些外观 - 问题，因为它需要固定长度，所以问题在于后视。我有一个解决方法，我将向您介绍\K转义序列。它会将匹配的开头重置为令牌列表中的当前位置。 ^more-info
好吧，让我们在我们的表达式中放置\K并使用前瞻，我们的最终正则表达式将是：

@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1
(*SKIP)(*FAIL)   # Skip it
|                # Or
url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|'|)           # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K               # Reset the match
(?!["']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\]|\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   \2            # Match what was matched in group 2
   \s*           # Match optionally some whitespaces
   \)            # Match )
)

demo

1.5使用PHP中的模式

我们需要转义一些内容，如引号，反斜杠\\\\ = \，使用正确的函数和正确的修饰符：

$regex = '~
@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1
(*SKIP)(*FAIL)   # Skip it
|                # Or
url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|\'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K               # Reset the match
(?!["\']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url's with http, https or ftp)
(?:[^\\\\]|\\\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   \2            # Match what was matched in group 2
   \s*           # Match optionally some whitespaces
   \)            # Match )
)
~xs';

$input = file_get_contents($css_file);
preg_match_all($regex, $input, $m);
echo '<pre>'. print_r($m[0], true) . '</pre>';

demo

第2部分：匹配@ font-face {}

内的网址

2.1不同的方法

我想在2个正则表达式中执行此部分，因为在递归正则表达式中处理大括号@font-face {}的状态时匹配{}内的URL是很痛苦的。

既然我们已经拥有了我们需要的部分，我们只需要在一些代码中应用它们：

匹配所有@font-face {}个实例
遍历这些并匹配所有url（）的

2.2将其置于代码中

$results = array(); // Just an empty array;
$fontface_regex = '~
@font-face\s*    # Match @font-face and some spaces
(                # Start group 1
   \{            # Match {
   (?:           # A non-capturing group
      [^{}]+     # Match anything except {} one or more times
      |          # Or
      (?1)       # Recurse/rerun the expression of group 1
   )*            # Repeat 0 or more times
   \}            # Match }
)                # End group 1
~xs';

$url_regex = '~
url\s*\(         # Match url, optionally some whitespaces and then (
\s*              # Match optionally some whitespaces
("|\'|)          # It seems that the quotes are optional according to http://www.w3.org/TR/CSS2/syndata.html#uri
\K               # Reset the match
(?!["\']?(?:https?://|ftp://))  # Put your negative-rules here (do not match url\'s with http, https or ftp)
(?:[^\\\\]|\\\\.)*?  # Match anything except a backslash or backslash and a character zero or more times ungreedy
(?=              # Lookahead
   \1            # Match what was matched in group 2
   \s*           # Match optionally some whitespaces
   \)            # Match )
)
~xs';

$input = file_get_contents($css_file);

preg_match_all($fontface_regex, $input, $fontfaces); // Get all font-face instances
if(isset($fontfaces[0])){ // If there is a match then
    foreach($fontfaces[0] as $fontface){ // Foreach instance
        preg_match_all($url_regex, $fontface, $r); // Let's match the url's
        if(isset($r[0])){ // If there is a hit
            $results[] = $r[0]; // Then add it to the results array
        }
    }
}
echo '<pre>'. print_r($results, true) . '</pre>'; // Show the results

demo

<子> <子> Join the regex chatroom !

Answer 2

您可以使用：

$pattern = <<<'LOD'
~
(?(DEFINE)
    (?<quoted_content>
        (["']) (?>[^"'\\]++ | \\{2} | \\. | (?!\g{-1})["'] )*+ \g{-1}
    )
    (?<comment> /\* .*? \*/ )
    (?<url_skip> (?: https?: | data: ) [^"'\s)}]*+ )
    (?<other_content>
        (?> [^u}/"']++ | \g<quoted_content> | \g<comment>
          | \Bu | u(?!rl\s*+\() | /(?!\*) 
          | \g<url_start> \g<url_skip> ["']?+
        )++
    )
    (?<anchor> \G(?<!^) ["']?+ | @font-face \s*+ { )
    (?<url_start> url\( \s*+ ["']?+ )
)

\g<comment> (*SKIP)(*FAIL) |

\g<anchor> \g<other_content>?+ \g<url_start> \K [./]*+ 

( [^"'\s)}]*+ )    # url
~xs
LOD;

$result = preg_replace($pattern, 'http://cdn.test.com/fonts/$8', $data);
print_r($result);

测试字符串

$data = <<<'LOD'
@font-face {
  font-family: 'FontAwesome';
  src: url("fonts/fontawesome-webfont.eot?v=4.0.3");
  src: url(fonts/fontawesome-webfont.eot?#iefix&v=4.0.3) format("embedded-opentype"),
     /*url("fonts/fontawesome-webfont.woff?v=4.0.3") format("woff"),*/
       url("http://domain.com/fonts/fontawesome-webfont.ttf?v=4.0.3") format("truetype"),
       url('fonts/fontawesome-webfont.svg?v=4.0.3#fontawesomeregular') format("svg");
  font-weight: normal;
  font-style: normal;
}
/*
@font-face {
  font-family: 'Font1';
  src: url("fonts/font1.eot");
} */
@font-face {
  font-family: 'Fon\'t2';
  src: url("fonts/font2.eot");
}
@font-face {
  font-family: 'Font3';
  src: url("../fonts/font3.eot");
}
LOD;

主要想法：

为了更具可读性，该模式被分为命名子模式。 (?(DEFINE)...)与任何内容都不匹配，它只是一个定义部分。

这种模式的主要技巧是使用\G锚意味着：字符串的开头或先前匹配。我添加了一个负面的lookbehind (?<!^)来避免这个定义的第一部分。

<anchor>命名子模式是最重要的，因为它仅在找到@font-face {时或在网址结束后立即允许匹配（这就是为什么你可以看到{{1}的原因}}）。

["']?+表示所有不是网址部分，但匹配必须跳过的网址部分（以“http：”，“data：”开头的网址）。此子模式的重要细节是它不能与@ font-face的结束花括号匹配。

<other_content>的使命只与<url_start>匹配。

url("重置匹配结果之前匹配的所有子字符串。

\K匹配网址（唯一保留在匹配结果中的领先([^"'\s)}]*+)）

由于./../和url子模式无法匹配<other_content>（在引用或注释部分之外），因此您肯定永远不会匹配@ font-face定义之外的内容，第二个结果是模式总是在最后一个URL后失败。因此，在下一次尝试时，“连续分支”将失败，直到下一个@ font-face。

另一招：

主模式以}开头，以跳过评论\g<comment> (*SKIP)(*FAIL) |中的所有内容。 /*....*/指的是描述注释外观的基本子模式。如果模式在他的右边失败，\g<comment>禁止重试之前匹配的子字符串（在他的左边，由(*SKIP)）。 g<comment>迫使模式失败。有了这个技巧，将跳过注释并且不是匹配结果（因为模式失败）。

子模式详细信息：

<强> quoted_content： 在(*FAIL)中使用它以避免匹配引号内的<other_content>或url(。

/*

other_content： 所有不是结束花括号，或没有(["']) # capture group: the opening quote (?> # atomic group: all possible content between quotes [^"'\\]++ # all that is not a quote or a backslash | # OR \\{2} # two backslashes: (two \ doesn't escape anything) | # OR \\. # any escaped character | # OR (?!\g{-1})["'] # the other quote (this one that is not in the capture group) )*+ # repeat zero or more time the atomic group \g{-1} # backreference to the last capturing group或http: 的网址

data:

<强>锚

(?>                     # open an atomic group
    [^u}/"']++          # all character that are not problematic!
  |
    \g<quoted_content>  # string inside quotes
  |
    \g<comment>         # string inside comments
  |
    \Bu                 # "u" not preceded by a word boundary
  |
    u(?!rl\s*+\()       # "u" not followed by "rl("  (not the start of an url definition)
  |                   
    /(?!\*)             # "/" not followed by "*" (not the start of a comment)
  |
    \g<url_start>       # match the url that begins with "http:"
    \g<url_skip> ["']?+ # until the possible quote
)++                     # repeat the atomic group one or more times

注意：

您可以改进主要模式：

在@ font-face的最后一个url之后，正则表达式引擎尝试与\G(?<!^) ["']?+ # contiguous to a precedent match with a possible closing quote | # OR @font-face \s*+ { # start of the @font-face definition的“连续分支”匹配并匹配所有字符，直到导致该模式失败的<anchor>为止。然后，在每个相同的字符上，正则表达式引擎必须尝试两个分支或}（这将永远失败，直到<anchor>。

要避免这些无用的尝试，您可以将主模式更改为：

使用这个新方案，最后一个网址后面的第一个字符与“连续分支”匹配，\g<comment> (*SKIP)(*FAIL) | \g<anchor> \g<other_content>?+ (?> \g<url_start> \K [./]*+ ([^"'\s)}]*+) | } (*SKIP)(*FAIL) )匹配所有字符，直到\g<other_content>，}立即失败，{{ 1}}匹配，\g<url_start>使模式失败，禁止重试这些字符。

通过在@ font-face中搜索替换，从@ font-face中提取网址

2 个答案:

第1部分：匹配不在@ font-face {}

1.1匹配@ font-face {}

1.2转义@ font-face {}

1.3匹配url（）

1.4匹配url（）

1.5使用PHP中的模式

第2部分：匹配@ font-face {}

2.1不同的方法

2.2将其置于代码中

主要想法：

另一招：

子模式详细信息：

注意：