如何查找文档集中唯一术语的数量?

时间:2020-09-27 16:45:47

标签: python-3.x

说我们有一个文件集

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

如何找到每个句子中唯一词的数量?

我用过

count = dict(Counter(word for sentence in document for word in sentence.split()))

我得到的结果是

{'this': 4, 'is': 4, 'the': 4, 'first': 2, 'document': 4, 'second': 1, 'and': 1, 'third': 1, 'one': 1}

我正在寻找一个键'document'的值为3而不是4的输出,因为它出现在4个句子中的3个中。

1 个答案:

答案 0 :(得分:0)

如果您只希望每个单词对一个单词计数一次,则可以在计数之前使一个句子中的单词bibformat('George R. R. Martin', 'A Game of Thrones', 'New York City', 'Bantam Spectra', 1996) 进行计数:

set

from collections import Counter corpus = [ 'this is the first document', 'this document is the second document', 'and this is the third one', 'is this the first document', ] count = dict(Counter(word for sentence in corpus for word in set(sentence.split()))) 将是:

count