我从yebrahim得到了这个tfidf,不知怎的,我的输出文档的结果全部为0。这有什么问题吗? 输出的例子是 河马0.0 臀部0.0 臀部0.0 提示0.0 后见之明0.0 山0.0 搞笑0.0
感谢您的帮助
# increment local count
for word in doc_words:
if word in terms_in_doc:
terms_in_doc[word] += 1
else:
terms_in_doc[word] = 1
# increment global frequency
for (word,freq) in terms_in_doc.items():
if word in global_term_freq:
global_term_freq[word] += 1
else:
global_term_freq[word] = 1
global_terms_in_doc[f] = terms_in_doc
print('working through documents.. ')
for f in all_files:
writer = open(f + '_final', 'w')
result = []
# iterate over terms in f, calculate their tf-idf, put in new list
max_freq = 0;
for (term,freq) in global_terms_in_doc[f].items():
if freq > max_freq:
max_freq = freq
for (term,freq) in global_terms_in_doc[f].items():
idf = math.log(float(1 + num_docs) / float(1 + global_term_freq[term]))
tfidf = float(freq) / float(max_freq) * float(idf)
result.append([tfidf, term])
# sort result on tfidf and write them in descending order
result = sorted(result, reverse=True)
for (tfidf, term) in result[:top_k]:
if display_mode == 'both':
writer.write(term + '\t' + str(tfidf) + '\n')
else:
writer.write(term + '\n')
答案 0 :(得分:3)
tf-idf的输出显然取决于你正确计算术语。如果你弄错了,那么结果会出乎意料。您可能希望输出每个单词的原始计数以验证这一点。例如,“hipp”这个词出现在当前文档和整个集合中的次数是多少次?
其他一些指示:
from __future__ import division
而不是使用显式浮点数进行除法。它使您的代码更具可读性。items()
。它创建了一个全新的(键,值)对列表,并带来了大量的计算和存储复杂性损失。迭代字典的键(for k in some_dictionary
)并使用常规索引来访问值(some_dictionary[k]
)。上述指针可能无法直接解决您的问题,但它们会使您的代码更容易阅读和理解(对于您和SO上的人员),可以更轻松地找到并解决问题。