当我使用R rword2vec包来计算文本正文中的单词类比时,我会得到很棒的直观答案。然而,当我使用bin_to_txt输出单词向量时,然后阅读这些并使用text2vec包计算类比,我得到了非常低质量的答案。
这两个软件包如何计算类比是否存在差异,如果是这样,如何在没有此软件包的情况下重现rword2vec的计算?不幸的是,我无法在我需要在生产中使用的机器上安装rword2vec。
在下面的例子中,我使用了类比王:queen :: man:_。使用相同的向量,rword2vec给了我"女人","女孩"和"金发"和使用sim2的text2vec方法给了我" man"," ahab"和" king"。我怎么搞砸后一种方法?
以下代码用于复制:
# Text file: http://mattmahoney.net/dc/text8.zip
# USING rword2vec
# require(devtools)
# install_github("mukul13/rword2vec")
require(rword2vec)
file1="C:\\Users\\bdk\\Downloads\\text8"
model = word2vec(train_file = file1, output_file = "vec_repro.bin",binary=1)
anal1 = word_analogy(file_name = "vec_repro.bin",search_words = "king queen man",num = 20)
print(anal1)
# first 10 results:
# word dist
# 1 woman 0.718326687812805
# 2 girl 0.607264637947083
# 3 blonde 0.56832879781723
# 4 lady 0.518971383571625
# 5 lovely 0.515585899353027
# 6 stranger 0.504840195178986
# 7 ladies 0.500177025794983
# 8 totoro 0.497228592634201
# 9 baby 0.497049778699875
# 10 handsome 0.490864992141724
bin_to_txt("vec_repro.bin","vector_repro.txt")
# Read in these word vectors and do the same calculation but with text2vec
require(text2vec)
data1=as.data.frame(read.table("vector_repro.txt",skip=1))
vocab_all2 <- data1
rownames(vocab_all2) <- vocab_all2[,1]
vocab_all2$V1 <- NULL
colnames(vocab_all2) <- NULL
vocab_all2 <- vocab_all2[complete.cases(vocab_all2),]
vocab_all3 <- data.matrix(vocab_all2)
guess1 <- (
vocab_all3["king", , drop = FALSE] -
vocab_all3["queen", , drop = FALSE] +
vocab_all3["man", , drop = FALSE]
)
cos_sim = sim2(x = vocab_all3, y = guess1, method = "cosine", norm = "l2")
head(sort(cos_sim[,1], decreasing = TRUE), 10)
# first 10 results:
# man ahab king god prophet faramir saladin saul enki usurper
# 0.7212826 0.4715135 0.4696279 0.4625656 0.4522798 0.4391127 0.4358722 0.4326022 0.4310836 0.4300992