为什么我从preg_replace()函数获得不同的返回值?

时间:2015-07-22 01:31:45

标签: php regex

我遇到了PHP preg_replace()函数的问题。

$string="[y-z]y-z_y[y_z]yav[v_v]";  // i want it to become : [y-z]yellow-zend_yellow[y_z]yav[v_v]

$find = array('/y(?=(?:.(?!\]))*\[)/Um', '/a(?=(?:.(?!\]))*\[)/Um', '/z(?=(?:.(?!\]))*\[)/Um', '/v(?=(?:.(?!\]))*\[)/Um');

$replace = array('yellow', 'avocado', 'zend', 'vodka');

echo preg_replace($find, $replace, $string)."<br><br>"; // display [y-zend]yellow-zend_yellow[y_zend]yellowavodkaocadovodka[v_v]

echo preg_replace('/y(?=(?:.(?!\]))*\[)/Um', 'yellow', $string)."<br><br>"; // display [y-z]yellow-z_yellow[y_z]yellowav[v_v]

echo preg_replace('/z(?=(?:.(?!\]))*\[)/Um', 'zend', $string)."<br><br>"; // display [y-zend]y-zend_y[y_zend]yav[v_v] --Why displaying zend inside[]?

另外,我想知道是否有一种方法可以在简单的PHP中使用附加条件:如果有任何&#34; yav&#34; &#34;] [&#34;之间的字符串,我想忽略它。

**[y-z]y-z_y[y_z]yav[v_v] ==> [y-z]yellow-zend_yellow[y_z]yav[v_v]**

OR

$var=[y-z]y-z[y_z]yav[v_v]; ==> $var=[y-z]yellow-zend[y_z]yav[v_v];

1 个答案:

答案 0 :(得分:2)

最后一个z]匹配因为你告诉它使用正向前看以匹配前方的负面看,它基本上是矛盾的。

如果前瞻匹配,你告诉它匹配z,而不匹配你不想要的东西,所以它匹配你不想要的东西,并说它确定匹配。无论如何,这在我的头脑中是有道理的。

https://regex101.com/r/nX5dQ6/1

你能否量化你的规则以匹配多个字符序列,肯定更容易用y-z_y替换yellow-zend_yellow但是没有上下文就不可能说这是否可能。

/z(?=(?:.(?!\]))*\[)/Um
    z matches the character z literally (case sensitive)
    (?=(?:.(?!\]))*\[) Positive Lookahead - Assert that the regex below can be matched
        (?:.(?!\]))* Non-capturing group
            Quantifier: * Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
            . matches any character (except newline)
            (?!\]) Negative Lookahead - Assert that it is impossible to match the regex below
                \] matches the character ] literally
        \[ matches the character [ literally
    U modifier: Ungreedy. The match becomes lazy by default. Now a ? following a quantifier makes it greedy
    m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

个人我可能会为它做一个 tokenizer ,想法就是使用preg_match_all,就像这样

 $matches = null;
 $returnValue = preg_match_all('/(?P<T_OPEN>\[)|(?P<T_CLOSE>\])|(?P<T_Y>y)|(?P<T_X>x)|(?P<T_Z>z)|(?P<T_SEPH>\-)|(?P<T_SEPU>\_)/', '[y-z]y-z_y[y_z]yav[v_v]', $matches, PREG_PATTERN_ORDER);

返回

 array (
    0 => 
         array (
               0 => '[',
               1 => 'y',
               2 => '-',
               3 => 'z',
               4 => ']',
             ...
        ),
   'T_OPEN' => 
        array (
            0 => '[',
            1 => '',
            2 => '',
            3 => '',
            4 => '',
   ..

通过一些后期处理,这可以简化为令牌列表

 array('T_OPEN', 'T_Y', 'T_SEPH', 'T_Z', 'T_CLOSE', ...);

哪些是命名的捕获组,那么编写一些逻辑来确定你是否在[]组中,或者如果T_Y,T_X,T_Z之前是另一个T_Y,T_X,T_Z标记,则非常简单你仔细阅读这个列表,这是最强大的方法。

要将其处理为仅仅令牌,请在[0] [0]匹配上使用for循环,看看其他人是否有这样的值(未经测试,但这是它的基础)

 $total = count($matches[0][0]);
    // remove numbered keys this is just an array of all the string keys, our tokens
 $tokens = array_filter( function( $item ){
       return preg_match('/^[^0-9]/', $item );
 }, array_keys( $matches ) );
 $tokens[] = 'T_UNKNOWN'; //add a default token for validation

 $tokenstream = array();
 for($i=0; $i<$total; $i++){
     //loop through the matches  for the index,
         foreach($tokens as $token ){
           //loop through the tokens and check $matches[$token][$i] for length
             if( strlen($matches[$token][$i]) > 0 ){
                  break; //break out of the foreach when we find our token which is now in $token - if we don't find it it's the last token T_UNKNOWN
             }
          }
         $tokenstream[] = $token;
}

然后使用标记从头开始构建字符串

 $out = '';
 $literal = false;

  foreach( $tokenstream as $offset => $token ){
        switch( $token ){
            case 'T_OPEN':
                  $out .= '[';
                  $literal = true;  //start brackets
            break;
            case 'T_CLOSE':
                  $out .= ']';
                  $literal = false; //end brackets
            break;
            case 'T_SEPH':
                  $out .= '-';
            break;
            case 'T_Y':
                   if( $literal ){  //if inside brackets literal y 
                      $out .= 'y';
                   }else{  // else use the word yellow
                      $out .= 'yellow';
                   }
            break;
            case 'T_UNKNOWN':
                   //validate
                   throw new Exception( "Error unknown token at offset: $offset");

         }
   }

你仍然需要找出T_Z,然后是T_A等等,但这将是一种确定的方法,并避免上述所有混乱。另外,这是一个非常粗略的方式来思考像这样的问题。