spaCy:token.vector计算不正确

时间:2017-05-30 13:08:49

标签: python nlp spacy

代码:

doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print doc[0],doc[2],doc[6],doc[8]
apples = doc[0]
oranges = doc[2]
boots = doc[6]
hippos = doc[8]
print(apples.similarity(oranges))
print(boots.similarity(hippos))

结果:

Apples oranges Boots hippos
0.0
0.0

Code is From Here.

Opened a GitHub Issue.

Documentation of spaCy表示相似性越高,返回的值越高,但苹果和橙子的相似度为0。 为什么呢?

修改

嗯,下面的代码解释了为什么不正确地计算相似性。 这是由于不正确的矢量计算:

doc = nlp(u'apples is apple. orange is not. oranges is nothing')
def dot_prd(a, b):
    ans = 0
    sa, sb = 0, 0
    for i in range(len(a)):
        ans += a[i]*b[i]
        sa += a[i]*a[i]
        sb += b[i]*b[i]
    sa = sa**0.5
    sb = sb**0.5
    return ans/(sa*sb)

print doc[0], doc[2], doc[4], doc[8]

print dot_prd(doc[0].vector, doc[2].vector), dot_prd(doc[0].vector,      doc[4].vector), dot_prd(doc[0].vector,doc[8].vector), dot_prd(doc[4].vector,    doc[8].vector)

print doc[0].similarity(doc[2]), doc[0].similarity(doc[4]),    doc[0].similarity(doc[8]), doc[4].similarity(doc[8])

输出:

apples apple orange oranges
0.750411317806 0.51238496547 nan nan   #Resuults of cosine-simlarity
0.750411349583 0.512384940626 0.0 0.0  #token.simlarity()

doc[8].vector全是零。那么,为什么“橘子”的矢量为'令牌计算为全0? ' orange'的向量&安培; '苹果'计算正确。更重要的是,苹果'也正确计算。那么,为什么' oranges'一个问题?

1 个答案:

答案 0 :(得分:1)

因为2个标记的字向量("橙子"和"河马")为零(这是模型问题)

您可以通过打印检查此令牌的矢量:

打印(oranges.vector) 打印(hippos.vector)