Question

我编写了以下脚本来计算文本文件中的句子数量：

import re

filepath = 'sample_text_with_ellipsis.txt'

with open(filepath, 'r') as f:
    read_data = f.read()

sentences = re.split(r'[.{1}!?]+', read_data.replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)

但是，如果我在sample_text_with_ellipsis.txt上使用以下内容运行它：

Wait for it... awesome!

我得到sentence_count = 2而不是1，因为它不会忽略省略号（即“...”）。

我在正则表达式中尝试做的是通过.{1}使其仅匹配一个句点的一次出现，但这显然不像我预期的那样工作。如何让正则表达式忽略省略号？

Answer 1

用这样的正则表达式分割句子是不够的。请参阅Python split text on sentences，了解如何利用NLTK。

回答你的问题，你将3点序列称为省略号。因此，您需要使用

[!?]+|(?<!\.)\.(?!\.)

请参阅regex demo。 .已从字符类中移出，因为您无法在其中使用量词，只有.匹配的内容未被其他点括起来。< / p>

[!?]+ - 1个或多个!或?
| - 或
(?<!\.)\.(?!\.) - 一个既不在前面（(?<!\.)）也不在后面（(?!\.)）带点的点。

请参阅Python demo：

import re
sentences = re.split(r'[!?]+|(?<!\.)\.(?!\.)', "Wait for it... awesome!".replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
print(sentence_count)  # => 1

Answer 2

根据Wiktor建议使用NLTK，我还提出了以下替代解决方案：

import nltk
read_data="Wait for it... awesome!"
sentence_count = len(nltk.tokenize.sent_tokenize(read_data))

这会使句子数量达到预期值。

考虑到椭圆的出现，如何计算句子

2 个答案: