Question

我从数据集tfidf得分中生成了一个词云，但是在构建词云之前，我以csv格式（例如医学，药学，病理学等）使用了不同的字典来过滤该数据集。

我能够使用每个字典对数据集进行过滤，但是我想要一种可以在不重复代码的情况下运行所有字典的方法

dictFile = open("Med.csv").read().splitlines() #read dictionary
        max_dist = 2
        new_keywords = [] # new dictionary
        for key in keywords: #read medical data
            terms = key
            tf_idf = keywords[key]
            print(terms, tf_idf)
min_dissim = max_dist #threshold
for words in dictFile:#iterate through dictionary
    cmpFile = jellyfish.levenshtein_distance(terms, words)

#filter with dictionary
        `if (cmpFile < min_dissim):`#terms within threshold
            print(cmpFile,terms,words)
      min_dissim = cmpFile #assigned to minimum distance i.e.< 2 
      print(min_dist)

if min_dissim == 0:
            break
new_tf_idf = tf_idf + ((max_dist - min_dissim)/max_dist)
new_keywords.append((terms, new_tf_idf))

d = dict(new_keywords)#wordcloud
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', normalize_plurals=False, max_words = 50,max_font_size = 100).generate_from_frequencies(d)

此代码的结果将产生单词云，其词云与我的医学词典中的单词相似，但我想要的是与所有词典相似的单词

提前谢谢

使用不同的字典来过滤数据集

0 个答案: