正则表达式提取一个只有两次单词频率的句子?

时间:2013-05-12 09:56:06

标签: regex

aardvark is an animal with aardvark
aardvark is an animal with aardvark along with another aardvark
aardvark is an animal with an elephant that loves an aardvark that lives in downtown

aardvark is an animal with aardvark. aardvark is an animal with aardvark along with another aardvark. aardvark is an animal with an elephant that loves an aardvark that lives in downtown

这是我必须仅提取aardvark仅发生两次的句子的文本。

我尝试了这个表达式((.*?)(aardvark)(.*?)(aardvark)(.*?)[\.\n])(.*\baardvark\b.*){2},但我将所有句子都作为答案。

我该如何处理?

4 个答案:

答案 0 :(得分:2)

试试这个:

^(((?!\baardvark\b)\b\w+\b\s+)*?\baardvark\b\s*((?!\baardvark\b)\b\w+\b\s+)*?){2}$

答案 1 :(得分:1)

如果您只是在寻找带有简单(静态)单词的句子,则根本不需要使用正则表达式。

$words = explode(' ', $sentence); # or preg_split, if you want to split on space, tab, hyphen, etc.
$counts = array_count_values($words);
if($count['aardvark'] == 2) {
  // found!
} else {
  // not interested
}

答案 2 :(得分:1)

如果你真的想使用正则表达式:

<?php

$data = 'aardvark aardvark aardvark aardvark
aardvark is an animal with aardvark
aardvark is an animal with aardvark along with another aardvark
aardvark is an animal with an elephant that loves an aardvark that lives in downtown';

preg_match_all("@(^|[\.\n])((?:(?!aardvark).)*(aardvark)(?:(?!aardvark).)*(aardvark)(?:(?!aardvark).)*)([\.\n]|$)@sU", ($data), $matches, PREG_SET_ORDER);

foreach($matches as $match)
    echo $match[2] . '<br />';

答案 3 :(得分:1)

你可以试试这个:

<pre>
<?php
$subject = <<<LOD
aardvark is an animal with aardvark
aardvark is an animal with aardvark along with another aardvark
aardvark is an animal with an elephant that loves an aardvark that lives in downtown
LOD;

$pattern = <<<'LOD'
~
(?(DEFINE) # the word
    (?<tw> \b aardvark \b ) )

(?(DEFINE) # other word
    (?<ow> \b (?!\g<tw>)[a-z]++ \b ) )

(?(DEFINE) # not a word 
    (?<nw>[^a-z]++) )

(?(DEFINE) # not the word
    (?<ntw> (?> \g<ow> | \g<nw> )++ ) )

# pattern :    
    ^ \g<ntw>? \g<tw> \g<ntw> \g<tw> \g<ntw>? $ 
~xim
LOD;
/* a more condensed version */
$pattern = <<<'LOD'
~
    ^ (?<ntw> (?> \b(?!\g<tw>)[a-z]++\b | [^a-z]++ )++ )?
      (?<tw> \b aardvark \b )
      \g<ntw> \g<tw> \g<ntw>? $
~xim
LOD;

preg_match_all($pattern, $subject, $matches);

print_r($matches[0]);

请注意,您可以使用(?<ow> \b (?> [b-z] | (?!\g<tw>)a ) [a-z]*+ \b ) )替换“ow”组以获得更高的性能,但请记住,您必须更改不以字母a开头的单词的字母和第一个类。 “考拉”的例子:

(?<ow> \b (?> [a-jl-z] | (?!\g<tw>)k ) [a-z]*+ \b ) )