Question

我试图使用python 2.7中的正则表达式从明文体中提取完整句子的列表。为了我的目的，将所有可以解释为完整句子的内容放在列表中并不重要，但列表中的所有内容都需要是一个完整的句子。以下是将说明问题的代码：

import re
text = "Hello World! This is your captain speaking."
sentences = re.findall("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)
print sentences

根据这个regex tester，我理论上应该得到这样的列表：

>>> ["Hello World!", "This is your captain speaking."]

但我实际得到的输出是这样的：

>>> [' World', ' speaking']

documentation表示findall从左到右搜索，并且贪婪地处理*和+运算符。感谢帮助。

Answer 1

问题是 findall（）只显示捕获的子组而不是完全匹配。根据{{3}}的文档：

如果模式中存在一个或多个组，则返回列表组;如果模式有多个，这将是一个元组列表基。

使用re.findall()和探索re.finditer()很容易看到正在发生的事情：

>>> import re
>>> text = "Hello World! This is your captain speaking."

>>> it = re.finditer("[A-Z]\w+(\s+\w+[,;:-]?)*[.!?]", text)

>>> mo = next(it)
>>> mo.group(0)
'Hello World!'
>>> mo.groups()
(' World',)

>>> mo = next(it)
>>> mo.group(0)
'This is your captain speaking.'
>>> mo.groups()
(' speaking',)

问题的解决方案是使用?:来抑制子组。然后你得到了预期的结果：

>>> re.findall("[A-Z]\w+(?:\s+\w+[,;:-]?)*[.!?]", text)
['Hello World!', 'This is your captain speaking.'

Answer 2

你可以稍微改变你的正则表达式：

>>> re.findall(r"[A-Z][\w\s]+[!.,;:]", text)
['Hello World!', 'This is your captain speaking.']

re.findall（）并不像预期的那样贪婪 - Python 2.7

2 个答案: