Question

我使用nltk PunktSentenceTokenizer()在python中为文本分段句子。但是，有许多长句以枚举方式出现，我需要在这种情况下得到子句。

示例：

The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.

所需的输出是：

"The api allows the user to achieve following goals aXXXXX. "，"The api allows the user to achieve following goals bXXXXX."和"The api allows the user to achieve following goals cXXXXX. "

我如何实现这一目标？

Answer 1

要获取子序列，您可以使用RegExp Tokenizer。

如何使用它来分割句子的示例可能如下所示：

from nltk.tokenize.regexp import regexp_tokenize

str1 = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'

parts =  regexp_tokenize(str1, r'\(\w\)\s*', gaps=True)

start_of_sentence = parts.pop(0)

for part in parts:
    print(" ".join((start_of_sentence, part)))

Answer 2

我会跳过这个显而易见的问题（“你到目前为止尝试了什么？”）。正如您可能已经发现的那样，PunktSentenceTokenizer并不会真正帮助您，因为它会将您的输入句子保留为一个部分。最佳解决方案在很大程度上取决于您输入的可预测性。以下将适用于您的示例，但正如您所看到的，它依赖于冒号和一些逗号。如果他们不在那里，那就不会帮助你。

import re
from nltk import PunktSentenceTokenizer
s = 'The api allows the user to achieve following goals: (a) aXXXXXX ,(b)bXXXX, (c) cXXXXX.'
#sents = PunktSentenceTokenizer().tokenize(s)

p = s.split(':')
for l in p[1:]:
    i = l.split(',')
    for j in i:
        j = re.sub(r'\([a-z]\)', '', j).strip()
        print("%s: %s" % (p[0], j))

如何根据枚举器将文本分段为子句？

2 个答案: