如何用gensim过滤掉语料库中低tf-idf的单词?

时间:2014-07-10 23:53:46

标签: python nlp gensim

我正在使用gensim进行一些NLP任务。我从dictionary.doc2bow创建了一个语料库,其中dictionarycorpora.Dictionary的对象。现在我想在运行LDA模型之前过滤掉低tf-idf值的术语。我查看了语料库类的documentation,但无法找到访问这些术语的方法。有任何想法吗?谢谢。

4 个答案:

答案 0 :(得分:5)

说你的语料库如下:

corpus = [dictionary.doc2bow(doc) for doc in documents]

运行TFIDF后,您可以检索低值单词列表:

tfidf = TfidfModel(corpus, id2word=dictionary)

low_value = 0.2
low_value_words = []
for bow in corpus:
    low_value_words += [id for id, value in tfidf[bow] if value < low_value]

然后在运行LDA之前将它们从字典中过滤出来:

dictionary.filter_tokens(bad_ids=low_value_words)

现在重新计算语料库,过滤掉低值词:

new_corpus = [dictionary.doc2bow(doc) for doc in documents]

答案 1 :(得分:2)

这是旧的,但如果您想查看每个文档级别,请执行以下操作:

#same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)

#filter low value words
low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    new_bow = [b for b in bow if b[0] not in low_value_words]

    #reassign        
    corpus[i] = new_bow

答案 2 :(得分:0)

这与先前的答案基本相同,但另外处理由于0分(在所有文档中存在的术语)而在tf-idf表示中缺失的单词。以前的答案没有过滤这些术语,它们仍然出现在最终的语料库中。

#Same as before

dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)


#Filter low value words and also words missing in tfidf models.

low_value = 0.025

for i in range(0, len(corpus)):
    bow = corpus[i]
    low_value_words = [] #reinitialize to be safe. You can skip this.
    tfidf_ids = [id for id, value in tfidf[bow]]
    bow_ids = [id for id, value in bow]
    low_value_words = [id for id, value in tfidf[bow] if value < low_value]
    words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing

    new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]  

#reassign        
corpus[i] = new_bow

答案 3 :(得分:0)

假设您有一个文档tfidf_doc,它是由gensim的TfidfModel()生成的,带有相应的单词文档袋bow_doc,并且您想要过滤tfidf值低于{{1 }}本文档中%的单词,您可以调用cut_percent,然后它将返回tfidf_filter(tfidf_doc, cut_percent)的简化版本:

tfidf_doc

然后,您要通过生成的def tfidf_filter(tfidf_doc, cut_percent): sorted_by_tfidf = sorted(tfidf_doc, key=lambda tup: tup[1]) cut_value = sorted_by_tfidf[int(len(sorted_by_tfidf)*cut_percent)][1] #print('before cut:',len(tfidf_doc)) #print('cut value:', cut_value) for i in range(len(tfidf_doc)-1, -1, -1): if tfidf_doc[i][1] < cut_value: tfidf_doc.pop(i) #print('after cut:',len(tfidf_doc)) return tfidf_doc ,jsut调用bow_doc来过滤文档tfidf_doc,它将返回filter_bow_by_tfidf(bow_doc, tfidf_doc)的剪切版本:

bow_doc