正则表达式用于捕获重复单词之间的组

时间:2019-05-13 21:24:40

标签: php regex preg-match pcre

关键字是“ * OR”或“ * AND”。

假设我有以下字符串:

  

这是带有!#等特殊字符的t3xt。 * AND这是   另一个带有特殊字符的文本*并且此重复*或不   重复*或具有更多字符串*并以此字符串结尾。

我想要以下

group1 "This is a t3xt with special characters like !#."  
group2 "*AND"  
group3 "and this is another text with special characters"  
group4 "*AND"  
group5 "this repeats"  
group6 "*OR"  
group7 "do not repeat"  
group8 "*OR"  
group9 "have more strings"  
group10 "*AND"  
group11 "finish with this string."  

我曾经这样尝试过:

(.+?)(\*AND\*OR)

但是它只获取第一个字符串,然后我需要继续重复代码以收集其他字符串,但是问题是有些字符串只有一个* AND,或者只有一个* OR或数十个字符串,即相当随机。而且下面的正则表达式也不起作用:

((.+?)(\*AND\*OR))+

例如:

  

这是带有!#等特殊字符的t3xt。 * AND这是   另一个带有特殊字符的文字

1 个答案:

答案 0 :(得分:2)

PHP对于此类事情有一个preg_split函数。 preg_split允许您使用分隔符来分割字符串,分隔符可以定义为正则表达式模式。此外,它还有一个参数,允许您在匹配/拆分结果中包括匹配的定界符。

因此,正则表达式用于分隔符本身,而不是编写用于匹配全文的正则表达式。

示例:

$string = "This is a t3xt with special characters like !#. *AND and this is another text with special characters *AND this repeats *OR do not repeat *OR have more strings *AND finish with this string.";
$string = preg_split('~(\*(?:AND|OR))~',$string,0,PREG_SPLIT_DELIM_CAPTURE);
print_r($string);

输出:

Array
(
    [0] => This is a t3xt with special characters like !#. 
    [1] => *AND
    [2] =>  and this is another text with special characters 
    [3] => *AND
    [4] =>  this repeats 
    [5] => *OR
    [6] =>  do not repeat 
    [7] => *OR
    [8] =>  have more strings 
    [9] => *AND
    [10] =>  finish with this string.
)

但是,如果您真的想坚持使用preg_match,则需要使用preg_match_all,它与preg_match(在问题中标记的内容)相似,除了它会进行全局/重复匹配。

示例:

$string = "This is a t3xt with special characters like !#. *AND and this is another text with special characters *AND this repeats *OR do not repeat *OR have more strings *AND finish with this string.";
preg_match_all('~(?:(?:(?!\*(?:AND|OR)).)+)|(?:\*(?:AND|OR))~',$string,$matches);
print_r($matches);

输出:

Array
(
    [0] => Array
        (
            [0] => This is a t3xt with special characters like !#. 
            [1] => *AND
            [2] =>  and this is another text with special characters 
            [3] => *AND
            [4] =>  this repeats 
            [5] => *OR
            [6] =>  do not repeat 
            [7] => *OR
            [8] =>  have more strings 
            [9] => *AND
            [10] =>  finish with this string.
        )

)

首先,请注意,与preg_split不同,preg_match_all(和preg_match)返回一个多维度数组,而不是单维度数组。其次,从技术上讲,我使用的模式可以简化一些,但是这样做的代价是必须引用返回的多维数组中的多个数组(一个数组用于匹配的文本,另一个数组用于匹配的定界符) ,那么您将不得不遍历和备用参考; IOW,将进行额外的清理,以获得带有两个匹配集的最终单个数组,如上所述。

我之所以仅显示此方法,是因为您在问题中从技术上要求您这样做,但我建议使用preg_split,因为它可以节省很多此类开销,以及为什么要首先创建它(更好的方法)解决这种情况)。