我有两个从wordnet.synsets()生成的同义词列表:
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
#convert tag to the one used by wordnet
def convert_tag(tag):
tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
try:
return tag_dict[tag[0]]
except KeyError:
return None
#define a function to find synset reference
def doc_to_synsets(doc):
token = nltk.word_tokenize(doc)
tag = nltk.pos_tag(token)
wordnet_tag = convert_tag(tag)
syns = [wn.synsets(token, wordnet_tag) for token in nltk.word_tokenize(doc)]
syns_list = [token[0] for token in syns if token]
return syns_list
#convert two example text documents
doc1 = 'This is a test function.'
doc2 = 'Use this function to check if the code in doc_to_synsets is correct!'
s1 = doc_to_synsets(doc1)
s2 = doc_to_synsets(doc2)

我正在尝试编写一个函数来查找s2中具有最大路径相似性的synset' s1中每个synset的得分。因此,对于包含4个唯一同义词集的s1,该函数应该返回4个路径相似度分数,我将从中转换为pandas Series对象以便于计算。
到目前为止,我一直在研究以下代码
def similarity_score(s1, s2):
list = []
for word1 in s1:
best = max(wn.path_similarity(word1, word2) for word2 in s2)
list.append(best)
return list

但是,它只返回一个没有任何值的空列表。
[]
有人会关心我的for循环有什么问题吗?或许可以启发我这个问题?
谢谢。
答案 0 :(得分:0)
我删除了“Sysnet”类引用,因为我没有那个类,并且它与评分目的无关。分数函数被抽象出来,因此您可以随意定义它。我采取了一个非常简单的规则。它比较了.
分隔符划分的每个位置,看它们是否相等。如果是,则分数增加。例如,在s1
中,与be.v.01
相比,be.f.02
得分为1,因为前缀匹配。如果我们将其与be.v.02
进行比较,我们将得分为2,等等。
s1 = [('be.v.01'),
('angstrom.n.01'),
('function.n.01'),
('trial.n.02')]
s2 = [('use.n.01'),
('function.n.01'),
('see.n.01'),
('code.n.01'),
('inch.n.01'),
('be.v.01'),
('correct.v.01')]
def score(s1,s2):
score = 0
for x,y in zip(s1.split('.'),s2.split('.')):
if x == y:
score += 1
return score
closest = [] # list of [target,best_match]
for sysnet1 in s1:
max_score = 0
best = None
for sysnet2 in s2:
cur_score = score(sysnet1,sysnet2)
if cur_score > max_score:
max_score = cur_score
best = sysnet2
closest.append([sysnet1,best])
print(closest)