Question

我是NlTK的新手，我在两个字符串上使用函数sent_tokenize，它提供的输出不同于预期

1）第一个字符串

sent_tokenize("An uncle is the female sibbling of one's parents. An aunt can also be the wife of an [[uncle]] who is the male sibbling of a parent")

输出：

[“叔叔是一个父母的女性s ..”， '阿姨也可以是[[叔叔]]的妻子，他是父母的男性唠叨']

2）第二次刺痛

sent_tokenize("An uncle is the female [[sibbling]] of one's [[parent]]s. An aunt can also be the wife of an [[uncle]] who is the male sibbling of a parent")

输出：

[“叔叔是一个[[父]]的女性[[sibbling]]。姨妈也可以是[[叔叔]]的妻子，她是父母的男性唠叨” p>

这是第二个，它不会给出两个句子，因为第一个可能是什么问题？

（我们可以使用split和delimiter作为“。”并获取句子，但想知道这里有什么问题）

Answer 1

sent_tokenize函数使用仅适用于格式良好的英语的punkt tokenizer。

句号可能并不总是意味着句子结束。例如：＆＃34;例如。或者是＃34;

在对其进行标记之前，请确保您的句子用英语形成。

在你的情况下，在一个简单的空间之后＃c;或由父母替换[[parent]]解决问题，但如果这不符合您的要求，请务必找出一种方法来清理文本。

您可以阅读有关Punkt tokenizer here的更多信息，这可以为您提供帮助。

Answer 2

你的句子没有错，但我想你可以写下面的第二句话，它运作正常：

sentence2 = "An uncle is the female [[sibbling]] of one's [[parent]]s . An aunt can also be the wife of an [[uncle]] who is the male sibbling of a parent"

但如果你在没有方括号的句子中写字，那么它也可以正常工作。我不知道为什么要指定方括号。

在我的gitHub（https://github.com/rameshjesswani/Semantic-Textual-Similarity/blob/master/nlp_basics/nltk/text_normalization.ipynb）中，我已经使用NLTK进行基本文本处理，你可以查看它。

sent_tokenize无法正常工作

2 个答案: