代码:
doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print doc[0],doc[2],doc[6],doc[8]
apples = doc[0]
oranges = doc[2]
boots = doc[6]
hippos = doc[8]
print(apples.similarity(oranges))
print(boots.similarity(hippos))
结果:
Apples oranges Boots hippos
0.0
0.0
Documentation of spaCy表示相似性越高,返回的值越高,但苹果和橙子的相似度为0。 为什么呢?
嗯,下面的代码解释了为什么不正确地计算相似性。 这是由于不正确的矢量计算:
doc = nlp(u'apples is apple. orange is not. oranges is nothing')
def dot_prd(a, b):
ans = 0
sa, sb = 0, 0
for i in range(len(a)):
ans += a[i]*b[i]
sa += a[i]*a[i]
sb += b[i]*b[i]
sa = sa**0.5
sb = sb**0.5
return ans/(sa*sb)
print doc[0], doc[2], doc[4], doc[8]
print dot_prd(doc[0].vector, doc[2].vector), dot_prd(doc[0].vector, doc[4].vector), dot_prd(doc[0].vector,doc[8].vector), dot_prd(doc[4].vector, doc[8].vector)
print doc[0].similarity(doc[2]), doc[0].similarity(doc[4]), doc[0].similarity(doc[8]), doc[4].similarity(doc[8])
输出:
apples apple orange oranges
0.750411317806 0.51238496547 nan nan #Resuults of cosine-simlarity
0.750411349583 0.512384940626 0.0 0.0 #token.simlarity()
doc[8].vector
全是零。那么,为什么“橘子”的矢量为'令牌计算为全0?
' orange'的向量&安培; '苹果'计算正确。更重要的是,苹果'也正确计算。那么,为什么' oranges'一个问题?
答案 0 :(得分:1)
因为2个标记的字向量("橙子"和"河马")为零(这是模型问题)
您可以通过打印检查此令牌的矢量:
打印(oranges.vector) 打印(hippos.vector)