如何查找和删除IDF值较低和较高的单词?

时间:2019-05-15 19:17:16

标签: python tfidfvectorizer

我正在查找基于其idf值的单词过滤。列表中有36k个单词,列表中有24k个单词的idf值。现在,我如何将每个单词及其idf值映射到它易于过滤。

我已经存储了一个数据帧中的所有唯一单词(36k),并且我的idf值为24k

a=list(project_data['final_input_text'].str.split(' ', expand=True).stack().unique())

我希望单词在字典或数据框中映射到其idf值

1 个答案:

答案 0 :(得分:1)

from sklearn.feature_extraction.text import TfidfVectorizer 
vectorizer = TfidfVectorizer() 
train_tf = vectorizer.fit(train['final_input_text'].values) 
idf_scores = train_tf.idf_

根据阈值下限= 8和阈值上限= 11过滤索引

filtered_indices = np.argwhere(((idf_scores> 8) & (idf_scores<11.55) ))
filtered_indices = [idx[0] for idx in filtered_indices]

#list of vocabulary from the vectorizer
vocabulary = train_tf.get_feature_names()

#preparing a set with filtered vocabulary
filtered_voc = {vocabulary[i] for i in filtered_indices}

从短文中删除单词(不是经过过滤的语音)

filtered_text_list = []
for text in train['final_input_text'].values:
  text_word_list = [word for word in text.split() if word in filtered_voc]
  filtered_text_list.append(' '.join(text_word_list))

现在filter_text_list将不包含IDF值低(<8)和高(> 11)的单词