说我们有一个文件集
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
如何找到每个句子中唯一词的数量?
我用过
count = dict(Counter(word for sentence in document for word in sentence.split()))
我得到的结果是
{'this': 4, 'is': 4, 'the': 4, 'first': 2, 'document': 4, 'second': 1, 'and': 1, 'third': 1, 'one': 1}
我正在寻找一个键'document'的值为3而不是4的输出,因为它出现在4个句子中的3个中。
答案 0 :(得分:0)
如果您只希望每个单词对一个单词计数一次,则可以在计数之前使一个句子中的单词bibformat('George R. R. Martin', 'A Game of Thrones', 'New York City', 'Bantam Spectra', 1996)
进行计数:
set
from collections import Counter
corpus = [
'this is the first document',
'this document is the second document',
'and this is the third one',
'is this the first document',
]
count = dict(Counter(word for sentence in corpus for word in set(sentence.split())))
将是:
count