Question

这里有点蟒蛇/编程新手...

我正在尝试提出一个正则表达式，它可以处理从文本文件中的一行中提取句子，然后将它们附加到列表中。代码：

import re

txt_list = []

with open('sample.txt', 'r') as txt:
    patt = r'.*}[.!?]\s?\n?|.*}.+[.!?]\s?\n?'
    read_txt = txt.readlines()

    for line in read_txt:
        if line == "\n":
            txt_list.append("\n")
        else: 
            found = re.findall(patt, line)
            for f in found:
                txt_list.append(f)


for line in txt_list:
    if line == "\n":
        print "newline"
    else:
        print line

根据上述代码的最后5行打印输出：

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! 
What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
I am the {very last|last} sentence for this {instance|example}.

'sample.txt'的内容：

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

I am the {very last|last} sentence for this {instance|example}.

我现在已经玩了几个小时的正则表达式，我似乎无法破解它。就目前而言，正则表达式在for lunch?的末尾不匹配。因此这两句What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.没有分开;这就是我想要的。

正则表达式的一些重要细节：

每个句子总是以句号，感叹号或问号结尾
每个句子总是包含至少一对大括号“{}”，其中包含一些单词。也不会产生误导性的“。”在每个句子的最后一个括号之后。因此Dr.将始终位于每个句子中最后一对花括号之前。这就是为什么我试图使用'}'来建立我的正则表达式。通过这种方式，我可以避免使用异常方法，为Dr.，Jr.，approx.等语法创建例外。对于我运行此代码的每个文件，我个人确保在任何句子中的最后一个'}'之后没有“误导期”。

我想要的输出是：

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! 
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
I am the {very last|last} sentence for this {instance|example}.

Answer 1

我得到的最直观的解决方案就是这个。基本上，您需要将Dr.和Mr.标记视为原子本身。

patt = r'(?:Dr\.|Mr\.|.)*?[.!?]\s?\n?'

分解，它说：

找到最少数量的Mr. s，Dr.或任何字符，直到一个标记，然后是零或一个空格，后跟零或一个新行。

在此sample.txt上使用时（我添加了一行）：

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}! What {will|shall|should} we {eat|have} for lunch? Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

But there are no {misters|doctors} here good sir! Help us if there is an emergency.

I am the {very last|last} sentence for this {instance|example}.

它给出了：

{Hello there|Hello|Howdy} Dr. Munchauson you {gentleman|fine fellow}!
What {will|shall|should} we {eat|have} for lunch?
Peas by the {thousand|hundred|1000} said Dr. Munchauson; {that|is} what he said.

newline
But there are no {misters|doctors} here good sir!
Help us if there is an emergency.

newline
I am the {very last|last} sentence for this {instance|example}.

Answer 2

如果你不介意添加一个依赖项，那么NLTK库有一个sent_tokenize函数可以做你需要的，虽然我不完全确定大括号是否会干扰。

描述NLTK使用方法的论文长达40多页。检测句子边界不是一项微不足道的任务。

Python：从行提取句子 - 基于标准需要正则表达式

2 个答案: