我正在使用gensim
进行一些NLP任务。我从dictionary.doc2bow
创建了一个语料库,其中dictionary
是corpora.Dictionary
的对象。现在我想在运行LDA模型之前过滤掉低tf-idf值的术语。我查看了语料库类的documentation,但无法找到访问这些术语的方法。有任何想法吗?谢谢。
答案 0 :(得分:5)
说你的语料库如下:
corpus = [dictionary.doc2bow(doc) for doc in documents]
运行TFIDF后,您可以检索低值单词列表:
tfidf = TfidfModel(corpus, id2word=dictionary)
low_value = 0.2
low_value_words = []
for bow in corpus:
low_value_words += [id for id, value in tfidf[bow] if value < low_value]
然后在运行LDA之前将它们从字典中过滤出来:
dictionary.filter_tokens(bad_ids=low_value_words)
现在重新计算语料库,过滤掉低值词:
new_corpus = [dictionary.doc2bow(doc) for doc in documents]
答案 1 :(得分:2)
这是旧的,但如果您想查看每个文档级别,请执行以下操作:
#same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)
#filter low value words
low_value = 0.025
for i in range(0, len(corpus)):
bow = corpus[i]
low_value_words = [] #reinitialize to be safe. You can skip this.
low_value_words = [id for id, value in tfidf[bow] if value < low_value]
new_bow = [b for b in bow if b[0] not in low_value_words]
#reassign
corpus[i] = new_bow
答案 2 :(得分:0)
这与先前的答案基本相同,但另外处理由于0分(在所有文档中存在的术语)而在tf-idf表示中缺失的单词。以前的答案没有过滤这些术语,它们仍然出现在最终的语料库中。
#Same as before
dictionary = corpora.Dictionary(doc_list)
corpus = [dictionary.doc2bow(doc) for doc in doc_list]
tfidf = models.TfidfModel(corpus, id2word = dictionary)
#Filter low value words and also words missing in tfidf models.
low_value = 0.025
for i in range(0, len(corpus)):
bow = corpus[i]
low_value_words = [] #reinitialize to be safe. You can skip this.
tfidf_ids = [id for id, value in tfidf[bow]]
bow_ids = [id for id, value in bow]
low_value_words = [id for id, value in tfidf[bow] if value < low_value]
words_missing_in_tfidf = [id for id in bow_ids if id not in tfidf_ids] # The words with tf-idf socre 0 will be missing
new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]
#reassign
corpus[i] = new_bow
答案 3 :(得分:0)
假设您有一个文档tfidf_doc
,它是由gensim的TfidfModel()
生成的,带有相应的单词文档袋bow_doc
,并且您想要过滤tfidf值低于{{1 }}本文档中%的单词,您可以调用cut_percent
,然后它将返回tfidf_filter(tfidf_doc, cut_percent)
的简化版本:
tfidf_doc
然后,您要通过生成的def tfidf_filter(tfidf_doc, cut_percent):
sorted_by_tfidf = sorted(tfidf_doc, key=lambda tup: tup[1])
cut_value = sorted_by_tfidf[int(len(sorted_by_tfidf)*cut_percent)][1]
#print('before cut:',len(tfidf_doc))
#print('cut value:', cut_value)
for i in range(len(tfidf_doc)-1, -1, -1):
if tfidf_doc[i][1] < cut_value:
tfidf_doc.pop(i)
#print('after cut:',len(tfidf_doc))
return tfidf_doc
,jsut调用bow_doc
来过滤文档tfidf_doc
,它将返回filter_bow_by_tfidf(bow_doc, tfidf_doc)
的剪切版本:
bow_doc