Question

我想知道特定单词出现的文档数量。例如，“狗”一词出现在100份文件中的67份文件中。

1个文件相当于1个文件。

因此，“狗”这个词的频率不需要计算。例如，在文件1中，“狗”出现了250次，但它只被视为一次计数，因为我的目标是计算文件而不是“狗”这个词出现在特定文件中的次数。

示例：

文件1：狗出现250次
文件2：狗出现了1000次
文件3：狗出现1次
文件4：狗出现0次
文件5：狗出现2次

所以答案必须是4

我有自己的算法，但我相信这是一种有效的方法。我正在使用Python 3.4和NLTK库。我需要帮助。谢谢你们！

这是我的代码

# DOCUMENT FREQUENCY
for eachadd in wordwithsource:
    for eachaddress in wordwithsource:
        if eachaddress == eachadd:
            if eachaddress not in copyadd:
                countofdocs=0
                copyadd.append(eachaddress)
                countofdocs = countofdocs+1
                addmanipulation.append(eachaddress[0])

for everyx in addmanipulation:
    documentfrequency = addmanipulation.count(everyx)
    if everyx not in otherfilter:
        otherfilter.append(everyx)
        documentfrequencylist.append([everyx,documentfrequency])

#COMPARE WORDS INTO DOC FREQUENCY 
for everywords in tempwords:
    for everydocfreq in documentfrequencylist:
        if everywords.find(everydocfreq[0]) !=-1:
            docfreqofficial.append(everydocfreq[1])

for everydocfrequency in docfreqofficial:
    docfrequency=(math.log10(numberofdocs/everydocfrequency))
    docfreqanswer.append(docfrequency)

Answer 1

您可以为每个文档存储频率字典，并使用另一个全局字典来显示单词的文档频率。为简单起见，我使用了Counter。

from collections import Counter

#using a list to simulate document store which stores documents
documents = ['This is document %d' % i for i in range(5)]

#calculate words frequencies per document
word_frequencies = [Counter(document.split()) for document in documents]

#calculate document frequency
document_frequencies = Counter()
map(document_frequencies.update, (word_frequency.keys() for word_frequency in word_frequencies))

print(document_frequencies)

>>>...Counter({'This': 5, 'is': 5, 'document': 5, '1': 1, '0': 1, '3': 1, '2': 1, '4': 1})

Answer 2

这可以在gensim中完成。

from gensim import corpora

dictionary = corpora.Dictionary(doc for doc in corpus)
dictionary.dfs

doc是标记列表，而语料库是文档列表。字典实例还存储整个词频（cfs）。

https://radimrehurek.com/gensim/corpora/dictionary.html

Python中的文档频率

2 个答案: