使用不同的字典来过滤数据集

时间:2019-04-10 10:05:51

标签: python levenshtein-distance word-cloud tfidfvectorizer

我从数据集tfidf得分中生成了一个词云,但是在构建词云之前,我以csv格式(例如医学,药学,病理学等)使用了不同的字典来过滤该数据集。

我能够使用每个字典对数据集进行过滤,但是我想要一种可以在不重复代码的情况下运行所有​​字典的方法

dictFile = open("Med.csv").read().splitlines() #read dictionary
        max_dist = 2
        new_keywords = [] # new dictionary
        for key in keywords: #read medical data
            terms = key
            tf_idf = keywords[key]
            print(terms, tf_idf)
min_dissim = max_dist #threshold
for words in dictFile:#iterate through dictionary
    cmpFile = jellyfish.levenshtein_distance(terms, words)

#filter with dictionary
        `if (cmpFile < min_dissim):`#terms within threshold
            print(cmpFile,terms,words)
      min_dissim = cmpFile #assigned to minimum distance i.e.< 2 
      print(min_dist)

if min_dissim == 0:
            break
new_tf_idf = tf_idf + ((max_dist - min_dissim)/max_dist)
new_keywords.append((terms, new_tf_idf))

d = dict(new_keywords)#wordcloud
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='white', normalize_plurals=False, max_words = 50,max_font_size = 100).generate_from_frequencies(d)     

此代码的结果将产生单词云,其词云与我的医学词典中的单词相似,但我想要的是与所有词典相似的单词

提前谢谢

0 个答案:

没有答案