
时间:2014-07-10 23:53:46

标签: python nlp gensim


4 个答案:

答案 0 :(得分:5)


corpus = [dictionary.doc2bow(doc) for doc in documents]


tfidf = TfidfModel(corpus, id2word=dictionary)

low_value = 0.2
low_value_words = []
for bow in corpus:
    low_value_words += [id for id, value in tfidf[bow] if value < low_value]




new_corpus = [dictionary.doc2bow(doc) for doc in documents]

答案 1 :(得分:2)


#same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)

#filter low value words
low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    new_bow = [b for b in bow if b[0] not in low_value_words]

    corpus[i] = new_bow

答案 2 :(得分:0)


#Same as before

dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)

#Filter low value words and also words missing in tfidf models.

low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]  

corpus[i] = new_bow

答案 3 :(得分:0)

假设您有一个文档tfidf_doc,它是由gensim的TfidfModel()生成的,带有相应的单词文档袋bow_doc,并且您想要过滤tfidf值低于{{1 }}本文档中%的单词,您可以调用cut_percent,然后它将返回tfidf_filter(tfidf_doc, cut_percent)的简化版本:


然后,您要通过生成的def tfidf_filter(tfidf_doc, cut_percent): sorted_by_tfidf = sorted(tfidf_doc, key=lambda tup: tup[1]) cut_value = sorted_by_tfidf[int(len(sorted_by_tfidf)*cut_percent)][1] #print('before cut:',len(tfidf_doc)) #print('cut value:', cut_value) for i in range(len(tfidf_doc)-1, -1, -1): if tfidf_doc[i][1] < cut_value: tfidf_doc.pop(i) #print('after cut:',len(tfidf_doc)) return tfidf_doc ,jsut调用bow_doc来过滤文档tfidf_doc,它将返回filter_bow_by_tfidf(bow_doc, tfidf_doc)的剪切版本:
