我从数据集tfidf得分中生成了一个词云,但是在构建词云之前,我以csv格式(例如医学,药学,病理学等)使用了不同的字典来过滤该数据集。
我能够使用每个字典对数据集进行过滤,但是我想要一种可以在不重复代码的情况下运行所有字典的方法
dictFile = open("Med.csv").read().splitlines() #read dictionary
max_dist = 2
new_keywords = [] # new dictionary
for key in keywords: #read medical data
terms = key
tf_idf = keywords[key]
print(terms, tf_idf)
min_dissim = max_dist #threshold
for words in dictFile:#iterate through dictionary
cmpFile = jellyfish.levenshtein_distance(terms, words)
#filter with dictionary
`if (cmpFile < min_dissim):`#terms within threshold
print(cmpFile,terms,words)
min_dissim = cmpFile #assigned to minimum distance i.e.< 2
print(min_dist)
if min_dissim == 0:
break
new_tf_idf = tf_idf + ((max_dist - min_dissim)/max_dist)
new_keywords.append((terms, new_tf_idf))
d = dict(new_keywords)#wordcloud
wordcloud = WordCloud(width = 800, height = 800,
background_color ='white', normalize_plurals=False, max_words = 50,max_font_size = 100).generate_from_frequencies(d)
此代码的结果将产生单词云,其词云与我的医学词典中的单词相似,但我想要的是与所有词典相似的单词
提前谢谢