Question

说我们有一个文件集

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

如何找到每个句子中唯一词的数量？

我用过

count = dict(Counter(word for sentence in document for word in sentence.split()))

我得到的结果是

{'this': 4, 'is': 4, 'the': 4, 'first': 2, 'document': 4, 'second': 1, 'and': 1, 'third': 1, 'one': 1}

我正在寻找一个键'document'的值为3而不是4的输出，因为它出现在4个句子中的3个中。

Answer 1

如果您只希望每个单词对一个单词计数一次，则可以在计数之前使一个句子中的单词bibformat('George R. R. Martin', 'A Game of Thrones', 'New York City', 'Bantam Spectra', 1996)进行计数：

set

from collections import Counter corpus = [ 'this is the first document', 'this document is the second document', 'and this is the third one', 'is this the first document', ] count = dict(Counter(word for sentence in corpus for word in set(sentence.split())))将是：

count

如何查找文档集中唯一术语的数量？

1 个答案: