这是我问here的问题的另一部分。因此,我决定将其作为另一个问题。
有没有办法让我可以在列matched_list_names
中的每个匹配列表名称旁边添加相关性值。因此,相关性值公式为(number of matched words from list/total number of words in that list)*100
,以便获得最相关的列表名称。因此,对于政治上的第一行,相关性为(1/3)*100=30%
,即列表政治中总共3个单词中有1个单词被匹配对于运动,则为(1/3)*100=0.3
,对于其他值,则为100-(sum of total value)
,即(100-(30+30)
。因此,输出将类似于:-
word_list matched_list_names
['nuclear','election','usa','baseball'] politics 30,sports 30,miscellaneous 40
['football','united','thriller'] sports 30,movies 30,miscellaneous 40
['marvels','spiderman','hockey'] movies 60,sports 30
.................... .....................
.................... .....................
.................... ....................
答案 0 :(得分:0)
使用:
movies=['spiderman','marvels','thriller']
sports=['baseball','hockey','football']
politics=['election','china','usa']
d = {'movies':movies, 'sports':sports, 'politics':politics}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
def f(x):
a = Counter([d1.get(y, 'miscellaneous') for y in x])
return ', '.join(['{} {}'.format(k, v / sum(a.values())* 100 ) for k, v in a.items()])
df['matched_list_names'] = df['word_list'].apply(f)
print (df)
word_list \
0 [nuclear, election, usa, baseball]
1 [football, united, thriller]
2 [marvels, hollywood, spiderman]
matched_list_names
0 miscellaneous 25.0, politics 50.0, sports 25.0
1 sports 33.33333333333333, miscellaneous 33.333...
2 movies 66.66666666666666, miscellaneous 33.333...