测量不同长度的同义词之间的相似性返回nan

时间:2017-08-31 05:10:13

标签: python list for-loop nltk wordnet

我有两组wordnet同义词(包含在两个单独的列表对象中,s1和s2),我想从中找到第一组synset(s1)中每个synset到第二组的最大路径相似度得分(s2)输出长度等于s1的长度。例如,如果包含4个synsets,那么输出的长度应为4;相反,当s2首次进入函数时(意味着s1与s2交换位置),长度输出应该等于s2。

我已经尝试了以下代码(到目前为止)。



import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd

#two wordnet synsets (s1, s2)

s1 = [wn.synset('multiple_sclerosis.n.01'),
 wn.synset('stewart.n.01'),
 wn.synset('head.n.04'),
 wn.synset('executive.n.01'),
 wn.synset('washington.n.02'),
 wn.synset('not.r.01'),
 wn.synset('expect.v.01'),
 wn.synset('attend.v.01')]
 
s2 = [wn.synset('multiple_sclerosis.n.01'),
 wn.synset('stewart.n.01'),
 wn.synset('sixty-one.s.01'),
 wn.synset('information_technology.n.01'),
 wn.synset('head.n.04'),
 wn.synset('executive.n.01'),
 wn.synset('military_officer.n.01'),
 wn.synset('president.n.04'),
 wn.synset('make.v.01'),
 wn.synset('not.r.01'),
 wn.synset('attend.v.01')]
 
# define a function to find the highest path similarity score for each synset in s1 onto s2, with the length of output equal that of s1

ps_list = []
def similarity_score(s1, s2):
    for word1 in s1:
        best = max(wn.path_similarity(word1, word2) for word2 in s2)
        ps_list.append(best)
    return ps_list

similarity_score(s1, s2)  # this one works fine

similarity_score(s2, s1)  # this one returns a nan




但是,正如我的代码的最后一行所述,当synset s2(包含11个synsets)首次进入函数时,该函数返回nan。我无法弄清楚导致问题的原因,我确定这是因为我处理的是不同长度的同义词,并且较长的一些同义词找不到匹配,因此,导致纳,或者可能有我的for循环有问题。

如果有人可以帮我向我澄清这个并建议一个替代解决方案(确实会返回一个数字,一个浮点数),我将非常感激,以便我可以将此函数应用于其他同义词。

谢谢。

1 个答案:

答案 0 :(得分:1)

除了使用ps_list变量(在调用similar_score()之间没有清除)之外,我的代码没有太多错误。

如果我们将ps_list更改为字典,我们可以检查每个单词的最佳分数。以下代码将其作为测试:

import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
from pprint import pprint

nltk.download("wordnet", "C:/Users/MackayA/Documents/Visual Studio 2017/Projects/PythonApplication9/PythonApplication9/nltk_data")
nltk.data.path.append("C:/Users/MackayA/Documents/Visual Studio 2017/Projects/PythonApplication9/PythonApplication9/nltk_data")

#two wordnet synsets (s1, s2)

s1 = [wn.synset('multiple_sclerosis.n.01'),
 wn.synset('stewart.n.01'),
 wn.synset('head.n.04'),
 wn.synset('executive.n.01'),
 wn.synset('washington.n.02'),
 wn.synset('not.r.01'),
 wn.synset('expect.v.01'),
 wn.synset('attend.v.01')]

s2 = [wn.synset('multiple_sclerosis.n.01'),
 wn.synset('stewart.n.01'),
 wn.synset('sixty-one.s.01'),
 wn.synset('information_technology.n.01'),
 wn.synset('head.n.04'),
 wn.synset('executive.n.01'),
 wn.synset('military_officer.n.01'),
 wn.synset('president.n.04'),
 wn.synset('make.v.01'),
 wn.synset('not.r.01'),
 wn.synset('attend.v.01')]

def similarity_score(set1, set2):
    ps_list = {}
    for word1 in set1:
        best = max(wn.path_similarity(word1, word2) for word2 in set2)
        ps_list[word1] = best
    return ps_list

pprint (similarity_score(s2, s1))

这给出了以下结果:

{Synset('attend.v.01'): 1.0,
 Synset('executive.n.01'): 1.0,
 Synset('head.n.04'): 1.0,
 Synset('information_technology.n.01'): 0.07142857142857142,
 Synset('make.v.01'): 0.3333333333333333,
 Synset('military_officer.n.01'): 0.14285714285714285,
 Synset('multiple_sclerosis.n.01'): 1.0,
 Synset('not.r.01'): 1.0,
 Synset('president.n.04'): 0.25,
 Synset('sixty-one.s.01'): None,
 Synset('stewart.n.01'): 1.0}

这似乎表明该算法找不到s2' sixty-one.s.01'的任何类型的匹配。进入s1的任何地方。可能值得在该单个条目上运行更多测试。

希望这很有用。