我有两个包含字符串的列表(公司名称和类别名称):
a1 = ['MAGIC', 'BUS']
a2 = ['TRANSPORTATION' , 'SERVICES', 'GROUP']
我想将列表1中的每个单词与列表2中的每个单词进行比较,并使用nltk获得每对中的语义相似度分数。我知道如何使用' wn.path_similarity(word_1_in_a1,word_1_in_a2)'手动比较每个单词。功能,但我希望能够在For循环中执行此操作。
以下是我的脚本:
if len(a1)>len(a2):
for x in range(len(a1)):
company_broken_down[x] = wn.synset(a1[x] + '.n.01')
for y in range(len(a2)):
category_broken_down[y] = wn.synset(a2[y] + '.n.01')
semantic_sim[x]=wn.path_similarity(company_broken_down[x], category_broken_down[y])
else:
for y in range(len(a2)):
category_broken_down[y] = wn.synset(a2[y] + '.n.01')
for x in range(len(a1)):
company_broken_down[x] = wn.synset(a1[x] + '.n.01')
semantic_sim[y]=wn.path_similarity(company_broken_down[x], category_broken_down[y])
print(semantic_sim)
运行上面的脚本后,我得到{0:0.14285714285714285,1:0.058823529411764705,2:0.09090909090909091},这是匹配单词' BUS'在列表a1中列出a2中的每个单词。但是,a1中的第一个单词' MAGIC'从未使用过。
有谁知道如何纠正我的For循环以使其输出所有6个相似度分数?非常感谢。
答案 0 :(得分:1)
你正在覆盖semantic_sim [y]。尝试下面的代码,其中semantic_sim的大小为len(a1)* len(a2):
if len(a1)>len(a2):
for x in range(len(a1)):
company_broken_down[x] = wn.synset(a1[x] + '.n.01')
for y in range(len(a2)):
category_broken_down[y] = wn.synset(a2[y] + '.n.01')
semantic_sim[x*len(a2) + y]=wn.path_similarity(company_broken_down[x], category_broken_down[y])
else:
for y in range(len(a2)):
category_broken_down[y] = wn.synset(a2[y] + '.n.01')
for x in range(len(a1)):
company_broken_down[x] = wn.synset(a1[x] + '.n.01')
semantic_sim[y*len(a1) + x]=wn.path_similarity(company_broken_down[x], category_broken_down[y])
print(semantic_sim)