在段落中拆分文本

时间:2018-11-16 23:57:23

标签: regex string python-3.x nlp

我知道我可以使用类似的东西

theText='She loves music. Her favorit instrument is the piano.\n\n However, \n\n she does not play it.'
paragraphs = [p for p in theText.split('\n\n') if p]
for i,p in enumerate(paragraphs):
    print(i,p)

将文本拆分为段落。

但是,我想添加一个附加条件,即下一个句子不能以小写字母开头。 实际代码提供了

0 She loves music. Her favorit instrument is the piano.
1  However, 
2  she does not play it.

我想要

0 She loves music. Her favorit instrument is the piano.
1  However, she does not play it.

我相信我应该使用一些正则表达式,但是我找不到正确的结构。

1 个答案:

答案 0 :(得分:1)

您可以使用以下正则表达式,以确保使用Lookahead \n\n?=后跟一个大写字母(以及可选的空格)。另外,在您的枚举中,您将必须摆脱\n\n(在这里使用re.sub):

import re
paragraphs = re.split('\n\n\s?(?=[A-Z])',theText)
for i,p in enumerate(paragraphs):
    print(i,re.sub('\n\n\s?','',p))

0 She loves music. Her favorit instrument is the piano.
1 However, she does not play it.