Question

我为文档中的所有术语计算了tf和idf，所以我有两个对象： 1）tf词典（大约有10k）：

{'doc_1': {'rain':0.4, 'sun':0.6}}
{'doc_2': {'............
{'doc_3': {'rain':0.1, .......

2）idf单一字典：

{'rain': 0.18, 'sun': 0.12......

3）我有所有术语的列表索引：

[{'term1':[[doc_1, 2],[doc_2, 3]]}, {'term2': [[doc_6, 6],[doc3,1]]}

每个学期

....等等

我现在如何计算tf*idf例如单词列表？我在努力：

def tf_idf(list_of words): 
    t_id={}
    for i in list_of_words:
        score= {}
        for j in terms: 
            score[j[0]]=(idf[i]*tf[j[0]][i])
        t_id[i]=score
    return t_id

它给我一个错误：

KeyError: 0

Answer 1

这里有一些通用的编程建议： def tf_idf(list_of words, tf, idf): # Pass your variables in, as opposed to using global scope. t_id={} for word in list_of_words: # Name your variables to avoid confusion score= {} for term in terms: score[term[0]]=(idf[word]*tf[term[0]][word]) t_id[i]=score return t_id

我认为这里的问题在于引用term[0]（您将此作为j[0]）。根据您的帖子，terms看起来像这样：

[{'term1':[[doc_1, 2],[doc_2, 3]]}...]

所以term（或j）只是：

{'term1':[[doc_1, 2],[doc_2, 3]]}

当您引用term[0] (or j [0] ), there would need to be an element in that dictionary with 0`作为其键时。

由于缺少该密钥，因此会出现KeyError。

Answer 2

我同意ers81239的编程建议。我分析了你的程序，我也看到了同样的错误。因此，您实际问题的真正答案是您无法提供数据结构。

然而，为了让你开始，我从头开始重写，试图从你的代码中解释你的意图。

首先，术语频率包含信息。我把它改成了嵌套的dict，然后计算了idf。这样可以减少数据结构，减少错误索引的位置。

由此可以计算出TF * IDF并不难。我尽可能地遗漏了。

from collections import Counter
from math import log

tf = {
        'term1': {'doc_1': 2, 'doc_2': 3},
        'term2': {'doc_2': 6, 'doc_3': 1},
}

N = float(len({k for doc_freqs in tf.itervalues()
                   for k in doc_freqs.iterkeys()}))

print N

idf = {term: N/len(doc_freqs)
           for term, doc_freqs in tf.iteritems()}

print idf


tf_idf = {term: tf[term][doc] * idf[term]
             for term, doc_freqs in tf.iteritems()
                for doc in doc_freqs}

print tf_idf

请记住，这是可以想象的最简单的TF * IDF，并且通常会应用很多改进。

TF IDF使用词典

2 个答案: