Question

作为一种熟悉Tensorflow的方法，我试图验证word2vec_basic.py生成的嵌入词（参见tutorial）在检查人类相似性得分时是否有意义。然而，结果令人惊讶地令人失望。这就是我的工作。

在word2vec_basic.py中，我在最后添加了另一个步骤，将嵌入和反向字典保存到磁盘（所以我不必每次都重新生成它们）：

with open("embeddings", 'wb') as f:
    np.save(f, final_embeddings)
with open("reverse_dictionary", 'wb') as f:
    pickle.dump(reverse_dictionary, f, pickle.HIGHEST_PROTOCOL)

在我自己的word2vec_test.py中，我加载它们并为查找创建直接字典：

with open("embeddings", 'rb') as f:
    embeddings = np.load(f)
with open("reverse_dictionary", 'rb') as f:
    reverse_dictionary = pickle.load(f)
dictionary = dict(zip(reverse_dictionary.values(), reverse_dictionary.keys()))

然后我将相似性定义为嵌入向量之间的欧氏距离：

def distance(w1, w2):
    try:
        return np.linalg.norm(embeddings[dictionary[w1]] - embeddings[dictionary[w2]])
    except:
        return None # no such word in our dictionary

到目前为止，结果很有意义，例如distance('before', 'after')小于distance('before', 'into')。

然后，我从http://alfonseca.org/pubs/ws353simrel.tar.gz下载了人类分数（我从“模型动物园”借用了Swivel项目中的链接和代码）。我将相似性和嵌入距离的人类得分进行比较如下：

with open("wordsim353_sim_rel/wordsim_relatedness_goldstandard.txt", 'r') as lines:
  for line in lines:
    w1, w2, act = line.strip().split('\t')
    pred = distance(w1, w2)
    if pred is None:
      continue

    acts.append(float(act))
    preds.append(-pred)

我使用-pred因为人类得分随着相似度的增加而增加，所以距离排序需要反转以匹配（较小的距离意味着更大的相似度）。

然后我计算相关系数：

rho, _ = scipy.stats.spearmanr(acts, preds)
print(str(rho))

但结果很小，比如0.006。我用4个单词的上下文和256的向量长度重新训练了word2vec_basic，但它根本没有改进。然后我使用余弦相似性而不是欧几里德距离：

def distance(w1, w2):
    return scipy.spatial.distance.cosine(embeddings[dictionary[w1]], embeddings[dictionary[w2]])

仍然没有相关性。

那么，我误解或做错了什么呢？

Answer 1

回答我自己的问题：是的，结果令人沮丧，但那是因为模型太小而且训练的数据太少。就如此容易。 The implementation I experimented with使用17M单词的语料库并运行100K步骤，并且只需要2个相邻的上下文单词，其嵌入大小为128.我得到了一个更大的维基百科样本，有124M单词，上下文增加到24个单词（12个每一面），嵌入大小为256，并训练为1.8M步骤，瞧！相关性（在我上面的问题中测量）增长到0.24。

然后我按照in this tutorial描述的频繁词的子采样，并且相关性进一步跃升至0.33。最后，我把我的笔记本电脑一夜之间用36个单词的上下文和3.2M的步骤进行训练，它一直到0.42！我想我们可以称之为成功。

所以，对于像我这样玩它的人来说，看起来它是一款需要大量数据，耐心和NVidia硬件的游戏（目前我还没有）。但它仍然充满乐趣。

word2vec_basic输出：尝试测试单词相似度与人类相似度得分

1 个答案: