Question

我正在使用编程集体情报书中的欧几里德距离示例


# Returns a distance-based similarity score for person1 and person2 
def sim_distance(prefs,person1,person2): 
  # Get the list of shared_items 
  si={} 
  for item in prefs[person1]: 
    if item in prefs[person2]: 
       si[item]=1 
  # if they have no ratings in common, return 0 
  if len(si)==0: return 0 
  # Add up the squares of all the differences 
  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) 
                      for item in prefs[person1] if item in prefs[person2]])

这是排名电影评论家的原始代码，我试图修改这个以找到类似的帖子，基于标签我建立一个地图，如，

url1 - > tag1 tag2
url2 - > tag1 tag3

但是如果将此应用于该函数，

pow(prefs[person1][item]-prefs[person2][item],2)

这变为0因为标签没有重量相同的标签有排名1.我修改了代码以手动创建差异进行测试，

pow(prefs[1,2)

然后我得到了很多0.5相似度，但同一帖子与它自相似的相似性下降到0.3。我想不出将欧几里德距离应用于我的情况的方法吗？

Answer 1

好的，首先，你的代码看起来不完整：我只看到你的函数返回一个。我认为你的意思是这样的：

def sim_distance(prefs, person1, person2): 
  # Get the list of shared_items
  p1, p2 = prefs[person1], prefs[person2]
  si = set(p1).intersection(set(p2))

  # Add up the squares of all the differences 
  matches = (p1[item] - p2[item] for item in si)
  return sum(a * a for a in matches)

接下来，为了清晰起见，您的帖子需要进行一些编辑。我不知道这意味着什么：“这变成了0因为标签没有重量相同的标签排名为1。”

最后，如果您提供prefs[person1]和prefs[person2]的示例数据，会有所帮助。然后你可以告诉你得到了什么以及你期望得到什么。

编辑：根据我在下面的评论，我会使用这样的代码：

def sim_distance(prefs, person1, person2):
    p1, p2 = prefs[person1], prefs[person2]
    s, t = set(p1), set(p2)
    return len(s.intersection(t)) / len(s.union(t))

Answer 2

基本上，标签没有重量，不能用数值表示。所以你不能定义两个标签之间的距离。

如果你想使用他们的标签找到两个帖子之间的相似性，我建议你使用相似标签的比例。例如，如果你有

url1 -> tag1 tag2 tag3 tag4
url2 -> tag1 tag4 tag5 tag6

然后你有2个类似的标签，代表2 (similar tags) / 4 (total tags) = 0.5。我认为这对于相似性来说是一个很好的衡量标准，只要每个帖子有超过2个标签。

基于标签的帖子之间的欧几里德距离

2 个答案: