我的正则表达式无法捕获连续的大写单词。 以下是我希望正则表达式捕获的内容:
"said Polly Pocket and the toys" -> Polly Pocket
这是我正在使用的正则表达式:
re.findall('said ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)', article)
返回以下内容:
[('Polly Pocket', ' Pocket')]
我希望它返回:
['Polly Pocket']
答案 0 :(得分:24)
使用积极的预测:
([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)
断言要接受的当前单词需要后跟另一个带有大写字母的单词。细分:
( # begin capture
[A-Z] # one uppercase letter \ First Word
[a-z]+ # 1+ lowercase letters /
(?=\s[A-Z]) # must have a space and uppercase letter following it
(?: # non-capturing group
\s # space
[A-Z] # uppercase letter \ Additional Word(s)
[a-z]+ # lowercase letter /
)+ # group can be repeated (more words)
) #end capture
答案 1 :(得分:6)
这是因为findall
返回正则表达式中的所有捕获组,并且您有两个捕获组(一个获取所有匹配的文本,另一个获取后续单词)。
您可以使用(?:regex)
代替(regex)
将第二个捕获组变为非捕获组:
re.findall('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', article)
答案 2 :(得分:4)
$mystring = "the United States of America has many big cities like New York and Los Angeles, and others like Atlanta";
@phrases = $mystring =~ /[A-Z][\w'-]\*(?:\s+[A-Z][\w'-]\*)\*/g;
print "\n" . join(", ", @phrases) . "\n\n# phrases = " . scalar(@phrases) . "\n\n";
输出:
$ ./try_me.pl
United States, America, New York, Los Angeles, Atlanta
\# phrases = 5