我正在尝试使用tf-idf进行功能选择。下面是我如何计算tf-idf
for item in items:
# increment local count
for word in doc_words:
if word in terms_in_doc:
terms_in_doc[word] += 1
else:
terms_in_doc[word] = 1
# increment global frequency
for (word,freq) in terms_in_doc.items():
if word in global_term_freq:
global_term_freq[word] += 1
else:
global_term_freq[word] = 1
global_terms_in_doc[doc] = terms_in_doc
result = []
# iterate over terms in doc, calculate their tf-idf, put in new list
max_freq = 0;
for (term,freq) in global_terms_in_doc[doc].items():
if freq > max_freq:
max_freq = freq
for (term,freq) in global_terms_in_doc[doc].items():
#The idf
idf = math.log(float(1 + num_docs) / float(1 + global_term_freq[term]))
#The tf-idf
tfidf = float(freq) / float(max_freq) * float(idf)
result.append([tfidf, term])
# sort result on tfidf
result = sorted(result, reverse=True)
我的想法基于tf-idf提取feature_words的前k值,用于训练数据以训练分类器。这个概念是否正确?是否有任何技术来确定最高k值?