结果

Question

我一直在尝试用python教自己Regexes，我决定打印出文本的所有句子。我一直在修补过去3个小时的正则表达无济于事。

我只是尝试了以下但无法做任何事情。

p = open('anan.txt')
process = p.read()
regexMatch = re.findall('^[A-Z].+\s+[.!?]$',process,re.I)
print regexMatch
p.close()

我的输入文件是这样的：

OMG is this a question ! Is this a sentence ? My.
name is.

这不打印输出。但是当我删除“我的。名字是。”时，它会打印OMG，这是一个问题，这是一个句子，就像它只读取第一行一样。

正则表达式的最佳解决方案是什么，可以找到文本文件中的所有句子 - 无论句子是否进入新行 - 或者还读取整个文本？感谢。

Answer 1

这样的工作：

## pattern: Upercase, then anything that is not in (.!?), then one of them
>>> pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
>>> pat.findall('OMG is this a question ! Is this a sentence ? My. name is.')
['OMG is this a question !', 'Is this a sentence ?', 'My.']

注意结果中name is.的结果不是，因为它不是以大写字母开头的。

您的问题来自^$锚点的使用，它们适用于整个文本。

Answer 2

你的正则表达式有两个问题：

您的表达式为anchored ^和$，分别是“行首”和“行尾”锚点。这意味着您的模式希望匹配文本的整行。
您在标点字符前搜索\s+，其中指定了one or more whitespace character。如果在标点符号之前没有空格，则表达式将不匹配。

Answer 3

已编辑：现在它也适用于多行句子。

>>> t = "OMG is this a question ! Is this a sentence ? My\n name is."
>>> re.findall("[A-Z].*?[\.!?]", t, re.MULTILINE | re.DOTALL )
['OMG is this a question !', 'Is this a sentence ?', 'My\n name is.']

只有一件事要解释 - re.DOTALL使.符合描述here

的换行符

Answer 4

我试过Notepad ++，我得到了这个：

.*$

并激活多线选项：

re.MULTILINE

干杯

Answer 5

谢谢cji和Jochen Ritzel。

sentence=re.compile("[A-Z].*?[\.!?] ", re.MULTILINE | re.DOTALL )

我认为这是最好的，只需在最后添加一个空格。

 SampleReport='I image from 08/25 through 12. The patient image 1.2, 23, 34, 45 and 64 from serise 34. image look good to have a tumor in this area.  It has been resected during the interval between scans.  The'

如果使用

pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)
pat.findall(SampleReport)

结果将是：

['I image from 08/25 through 12.',
'The patient image 1.',
 'It has been resected during the interval between scans.']

错误是它无法处理像1.2这样的数字。但是这个完美无缺。

sentence.findall(SampleReport)

结果

['I image from 08/25 through 12. ',
'The patient image 1.2, 23, 34, 45 and 64 from serise 34. ',
 'It has been resected during the interval between scans. ']

Answer 6

尝试相反的方法：在句子边界处拆分文本。

lines = re.split(r'\s*[!?.]\s*', text)

如果不起作用，请在\之前添加.。

Answer 7

您可以尝试：

p = open('a')
process = p.read()
print process
regexMatch = re.findall('[^.!?]+[.!?]',process)
print regexMatch
p.close()

此处使用的正则表达式是[^.!?]+[.!?]，它会尝试匹配一个或多个非句子分隔符，后跟句子分隔符。

正则表达式找到所有文本句子？

7 个答案:

结果