Question

我正在尝试使用tf-idf进行功能选择。下面是我如何计算tf-idf

for item in items:
    # increment local count
    for word in doc_words:
        if word in terms_in_doc:
            terms_in_doc[word] += 1
        else:
            terms_in_doc[word]  = 1

    # increment global frequency
    for (word,freq) in terms_in_doc.items():
        if word in global_term_freq:
            global_term_freq[word] += 1
        else:
            global_term_freq[word]  = 1

    global_terms_in_doc[doc] = terms_in_doc

    result          = []
    # iterate over terms in doc, calculate their tf-idf, put in new list
    max_freq = 0;
    for (term,freq) in global_terms_in_doc[doc].items():
        if freq > max_freq:
            max_freq = freq
    for (term,freq) in global_terms_in_doc[doc].items():
        #The idf
        idf = math.log(float(1 + num_docs) / float(1 + global_term_freq[term]))
        #The tf-idf
        tfidf = float(freq) / float(max_freq) * float(idf)
        result.append([tfidf, term])

    # sort result on tfidf
    result = sorted(result, reverse=True)

我的想法基于tf-idf提取feature_words的前k值，用于训练数据以训练分类器。这个概念是否正确？是否有任何技术来确定最高k值？

TF-IDF加权选择选择前k个

0 个答案: