这里有点蟒蛇/编程新手...
我正在尝试提出一个正则表达式,它可以处理从文本文件中的一行中提取句子,然后将它们附加到列表中。代码:
import re
txt_list = []
with open('sample.txt', 'r') as txt:
patt = r'.*}[.!?]\s?\n?|.*}.+[.!?]\s?\n?'
read_txt = txt.readlines()
for line in read_txt:
if line == "\n":
txt_list.append("\n")
else:
found = re.findall(patt, line)
for f in found:
txt_list.append(f)
for line in txt_list:
if line == "\n":
print "newline"
else:
print line
根据上述代码的最后5行打印输出:
{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}!
What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.
newline
I am the {very last|last} sentence for this {instance|example}.
'sample.txt'的内容:
{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.
I am the {very last|last} sentence for this {instance|example}.
我现在已经玩了几个小时的正则表达式,我似乎无法破解它。就目前而言,正则表达式在for lunch?
的末尾不匹配。因此这两句What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.
没有分开;这就是我想要的。
正则表达式的一些重要细节:
Dr.
将始终位于每个句子中最后一对花括号之前。这就是为什么我试图使用'}'来建立我的正则表达式。通过这种方式,我可以避免使用异常方法,为Dr.
,Jr.
,approx.
等语法创建例外。对于我运行此代码的每个文件,我个人确保在任何句子中的最后一个'}'之后没有“误导期”。我想要的输出是:
{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}!
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.
newline
I am the {very last|last} sentence for this {instance|example}.
答案 0 :(得分:2)
我得到的最直观的解决方案就是这个。基本上,您需要将Dr.
和Mr.
标记视为原子本身。
patt = r'(?:Dr\.|Mr\.|.)*?[.!?]\s?\n?'
分解,它说:
找到最少数量的
Mr.
s,Dr.
或任何字符,直到一个标记,然后是零或一个空格,后跟零或一个新行。
在此sample.txt上使用时(我添加了一行):
{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.
But there are no {misters|doctors} here good sir! Help us if there is an emergency.
I am the {very last|last} sentence for this {instance|example}.
它给出了:
{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}!
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.
newline
But there are no {misters|doctors} here good sir!
Help us if there is an emergency.
newline
I am the {very last|last} sentence for this {instance|example}.
答案 1 :(得分:2)
如果你不介意添加一个依赖项,那么NLTK库有一个sent_tokenize
函数可以做你需要的,虽然我不完全确定大括号是否会干扰。
描述NLTK使用方法的论文长达40多页。检测句子边界不是一项微不足道的任务。