Question

我必须剪切一个 unicode 字符串，这实际上是一篇文章（包含句子）我想在python中的第X个句子后剪切这篇文章字符串。

句子结尾的一个好指标是它以句号结束（“。”）和以大写字母开头后的单词。如

myarticle == "Hi, this is my first sentence. And this is my second. Yet this is my third."

如何实现这一目标？

由于

Answer 1

考虑下载Natural Language Toolkit（NLTK）。然后你可以创建一些不会因“U.S.A”而破坏的句子。或者不能分割以“？！”结尾的句子。

>>> import nltk
>>> paragraph = u"Hi, this is my first sentence. And this is my second. Yet this is my third."
>>> sentences = nltk.sent_tokenize(paragraph)
[u"Hi, this is my first sentence.", u"And this is my second.", u"Yet this is my third."]

您的代码变得更具可读性。要访问第二句，请使用您习惯的符号。

>>> sentences[1]
u"And this is my second."

Answer 2

这是一个更强大的解决方案：

myarticle = """This is a sentence.
   And another one.
   And a 3rd one."""

N = 3  # 3 sentences

print ''.join(sentence+'.' for sentence in re.split('\.(?=\s*(?:[A-Z]|$))', myarticle, maxsplit=N)[:-1])

与以前提到的其他一些可能性相比，此解决方案具有一些优势：

即使文字中有N个句子，它也能正常工作。其他一些答案最后产生一个双.。考虑到最后一句话后面没有大写字母，而是文字结尾（$）这一事实，可以避免这种情况。
即使文字中的N个句子少于maxsplit，也会有效。
分割数量受re.split() {{1}}参数的限制，这限制了分割次数，因此非常有效。

希望这有帮助！

Answer 3

如果可能有其他标点符号而不是通常的'。'，你应该试试这个：

re.split('\W(?=[A-Z])',ss)

这将返回句子列表。当然，它没有正确对待保罗提到的案例。

Answer 4

试试这个：

'.'.join(re.split('\.(?=\s*[A-Z])', myarticle)[:2]) + '.'

在第二句（[：2]）之后剪切你的字符串。

然而，存在一些问题（如果你处理自然语言一样）：最值得注意的是它只会识别一个以'A-Z'开头的句子。这可能适用于英语，但不适用于其他语言。

Python在第X句后剪了一个字符串

4 个答案: