Question

此脚本在文本（BBCode）中标识“online test”（带参数和值）：

<?php
preg_match_all(
    '#\[(link)(.*?)!?\](.*?)\[\/\\1\]#i', 
    '[link href="http://www.google.com" title="Google" target="_blank"]Google[/link]
     [link href="http://www.facebook.com"]Facebook[/link]
     [link href=\'http://www.twitter.com\' rel="nofollow"]Twitter[/link]', 
    $StrMatches
);

/* $StrMatches[0] = Full tag string
 * $StrMatches[1] = Tag name
 * $StrMatches[2] = tag params string
 * $StrMatches[3] = Tag content
 * */
print_r($StrMatches);


$ParamList = array();

foreach ($StrMatches[2] as $TagParamStr )
{
   preg_match_all('#\s*([^=]+)=[\'|"]([^\'|"]*)[\'|"]#', $TagParamStr, $ParamMatches);
   array_push($ParamList, $ParamMatches);
}

/* $ParamList[0] = Full param string
 * $ParamList[1] = Param name
 * $ParamList[2] = Param value
 * */
print_r($ParamList);

输出：

 Array
(
[0] => Array
    (
        [0] => [link href="http://www.google.com" title="Google" target="_blank"]Google[/link]
        [3] => [link href="http://www.facebook.com"]Facebook[/link]
        [2] => [link href='http://www.twitter.com' rel="nofollow"]Twitter[/link]
    )

[1] => Array
    (
        [0] => link
        [1] => link
        [2] => link
    )

[2] => Array
    (
        [0] =>  href="http://www.google.com" title="Google" target="_blank"
        [1] =>  href="http://www.facebook.com"
        [2] =>  href='http://www.twitter.com' rel="nofollow"
    )

[3] => Array
    (
        [0] => Google
        [1] => Facebook
        [2] => Twitter
    )

) 
Array
(
[0] => Array
    (
        [0] => Array
            (
                [0] =>  href="http://www.google.com"
                [1] =>  title="Google"
                [2] =>  target="_blank"
            )

        [1] => Array
            (
                [0] => href
                [1] => title
                [2] => target
            )

        [2] => Array
            (
                [0] => http://www.google.com
                [1] => Google
                [2] => _blank
            )

    )

[1] => Array
    (
        [0] => Array
            (
                [0] =>  href="http://www.facebook.com"
            )

        [1] => Array
            (
                [0] => href
            )

        [2] => Array
            (
                [0] => http://www.facebook.com
            )

    )

[2] => Array
    (
        [0] => Array
            (
                [0] =>  href='http://www.twitter.com'
                [1] =>  rel="nofollow"
            )

        [1] => Array
            (
                [0] => href
                [1] => rel
            )

        [2] => Array
            (
                [0] => http://www.twitter.com
                [1] => nofollow
            )

    )

)

代码工作正常！但我想用一个RegEx来优化它。

如何使其成为独特的RegEx？

抱歉我的英语不好：（

Answer 1

简答：

以你想象的方式不太可能，因为正则表达式会捕获一组已定义的组。最理想的方法是使用一个匹配来捕获param1，param2，value。但由于属性数量发生变化，这是不可能的。如果我们尝试重复捕获组1次以上，它将匹配整个字符串但仅捕获最后一次出现as shown in this quick demo。

但是，您将看到可以将所有这些数据匹配并捕获到一个表达式中。但是每个链接将分成多个匹配，每个匹配包含一些数据。在我的示例中，我使用捕获组1作为属性，捕获组2作为属性的值，捕获组3作为链接的值。如果匹配中不存在这些项，则捕获组将保留为空。

<强>解释

(?# START OF LINK)
(?:         (?# start non-capture group)
  \[link    (?# match [link literally)
 |          (?# OR)
  (?!^)     (?# assertion to make sure we aren't at the beginning of the string)
  \G        (?# start at the end of last match)
)           (?# end non-capture group)
\K          (?# throw everything to the left away)

(?# START OF CAPTURING)
(?:         (?# start non-capture group)
  \s+       (?# match 1+ whitespace characters)
  ([^=\s]+) (?# capture attribute)
  =         (?# match = literally)
  ["']      (?# match ' or ")
  (.*?)     (?# lazily capture attribute's value)
  ["']      (?# match ' or ")
 |          (?# OR)
  \s*       (?# optionally match whitespace characters)
  \]        (?# match ] literally)
  (.*?)     (?# lazily capture link's value)
  \[/link\] (?# match [/link] literally)
)           (?# end non-capture group)

Demo

关键是\G和\K。 RegEx引擎第一次进行匹配时，它从[link开始，匹配的所有内容都会被\K丢弃。然后我们继续捕捉我们找到的位置并获取属性及其值。然后比赛结束。现在它再次返回，找不到[link，因此它使用\G从最后一个属性开始。用\K再次抛弃一切。它可以找到另一个属性，或者它可以命中交替并使链接的末尾与第三个捕获组匹配。此时，当正则表达式重新开始时，它将再次找到另一个[link并重新执行此操作。

更新：您会在(?!^)之前看到\G解决问题中的问题。 \G不仅匹配上一个匹配的结尾，还匹配字符串的开头。我们希望在开始匹配内容（[link）之前确保我们处于链接中，这意味着我们不希望\G匹配字符串的开头。这种消极的前瞻将断言。

<强> PHP：

$regex = '#(?:\[link|(?!^)\G)\K(?:\s+(\w+)=["\'](.*?)["\']|\s*\](.*?)\[/link\])#si';
preg_match_all($regex, $html, $matches, PREG_SET_ORDER);

$links = array();
$reset = true;

foreach($matches as $match) {
    if($reset) {
        $links[] = array(
            'params' => array(),
            'value' => null
        );

        $reset = false;
    }

    end($links);
    $key = key($links);

    if(isset($match[3])) {
        $links[$key]['value'] = $match[3];
        $reset = true;
    } else {
        $links[$key]['params'][$match[1]] = $match[2];
    }
}

var_dump($links);

<强>输出：

array(3) {
  [0]=>
  array(2) {
    ["params"]=>
    array(3) {
      ["href"]=>
      string(21) "http://www.google.com"
      ["title"]=>
      string(6) "Google"
      ["target"]=>
      string(6) "_blank"
    }
    ["value"]=>
    string(6) "Google"
  }
  [1]=>
  array(2) {
    ["params"]=>
    array(1) {
      ["href"]=>
      string(23) "http://www.facebook.com"
    }
    ["value"]=>
    string(8) "Facebook"
  }
  [2]=>
  array(2) {
    ["params"]=>
    array(2) {
      ["href"]=>
      string(22) "http://www.twitter.com"
      ["rel"]=>
      string(8) "nofollow"
    }
    ["value"]=>
    string(7) "Twitter"
  }
}

PHP RegEx for BBCode多参数

1 个答案: