我正在查找基于其idf值的单词过滤。列表中有36k个单词,列表中有24k个单词的idf值。现在,我如何将每个单词及其idf值映射到它易于过滤。
我已经存储了一个数据帧中的所有唯一单词(36k),并且我的idf值为24k
a=list(project_data['final_input_text'].str.split(' ', expand=True).stack().unique())
我希望单词在字典或数据框中映射到其idf值
答案 0 :(得分:1)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
train_tf = vectorizer.fit(train['final_input_text'].values)
idf_scores = train_tf.idf_
根据阈值下限= 8和阈值上限= 11过滤索引
filtered_indices = np.argwhere(((idf_scores> 8) & (idf_scores<11.55) ))
filtered_indices = [idx[0] for idx in filtered_indices]
#list of vocabulary from the vectorizer
vocabulary = train_tf.get_feature_names()
#preparing a set with filtered vocabulary
filtered_voc = {vocabulary[i] for i in filtered_indices}
从短文中删除单词(不是经过过滤的语音)
filtered_text_list = []
for text in train['final_input_text'].values:
text_word_list = [word for word in text.split() if word in filtered_voc]
filtered_text_list.append(' '.join(text_word_list))
现在filter_text_list将不包含IDF值低(<8)和高(> 11)的单词