Question

我正在尝试清除文本，在此示例中为文章。因为我将文本排成一行，所以我想将每个句子放在新行中，所以我只是这样做了：

content.replace(".", ".\n")

嗯，它没有用。这篇文章包含e.g. Dr. Taylor Train Nr. 11512之类的东西，所以很明显，我的结果看起来很愚蠢。

有人知道我可以用来可靠地从实际句点中过滤掉这些“无句结尾”句点吗？在这种情况下，我可以检查句号前面的字符串是否是一个实际的单词，方法是检查它是否包含元音和辅音。但总的来说，我不知道该怎么办。

Answer 1

我知道，这并不能真正回答您的问题，但是，如果您只是想“清理”文本以使其打印效果很好，则可以在一定数量的字符后插入新行，而不是一句话：

text = """Does anyone have an idea what i can use to reliably filter out these "non-sentence ending" full stops from actual full stops? In this case, i could just check if the string in front of the full stop is an actual word, by checking if it contains a vowel and a consonant i guess. But in general, i have no idea what i can do here."""

text = text.split(' ')
line_length = 0
index = 0

for word in text:
    if (line_length + len(word)) < 70:
        index += 1
        line_length += len(word) + 1
    else:
        text.insert(index, '\n')
        index += 2
        line_length = len(word) + 1

print(' '.join(text))

输出为：

Does anyone have an idea what i can use to reliably filter out these 
 "non-sentence ending" full stops from actual full stops? 
 In this case, i could just check if the string in front of the full 
 stop is an actual word, by checking if it contains a vowel and a consonant 
 i guess. But in general, i have no idea what i can do here.

Answer 2

您要问的内容并非无关紧要，并且应说明许多例外情况。另外，没有示例，我们只能给出广泛的建议。
但是，您可以添加一些规则，这些规则可以快速实施以改善您的regex。我猜这比替换具有更大的灵活性。

在句点之后总是有一个空格，随后的句子应以大写字母开头。因此，您应该使用考虑到这一点的regex。 [A-Z]要匹配A和Z之间的任何大写字母。
列出您的例外情况"Dr., Nr., Mr. Eng., PhD., Ph.D., George W. Bush", etc.，并且不要替换这些情况（如brevno在其评论中所建议）。这些可能最终会导致太多的情况，但是您可以通过添加以下规则来捕获大多数此类异常。

2.1。如果句号前面的单词没有元音，则不要切。

2.2。如果句号前面的单词只有一个或两个字母，则不会剪切。

您可能需要考虑许多其他例外，但这是我的头等大事。

Answer 3

尝试这种方法：

import re

text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

for stuff in sentences:
        print(stuff)

输出：

Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.

第一个块：(?<!\w\.\w.)：此模式在负反馈循环(?<!)中搜索所有单词(\w)，后跟句号(\.)，之后是其他单词{{1} }

第二个块：(\.)：此模式在负反馈循环中搜索以大写字母(?<![A-Z][a-z]\.)开头，然后是小写字母([A-Z])直到点([a-z])的所有内容找到了。

第三块：(\.)：此模式在点(?<=\.|\?)或问号(\.)的反馈循环中进行搜索

第四块：(\?)：此模式在第三块的点或问号之后搜索。它搜索空格(\s|[A-Z].*)或以大写字母(\s)开头的任何字符序列。如果输入为

，则此块对于拆分很重要。

Python-句子结尾和其他句点之间的区别

3 个答案: