Question

我正在尝试使用text2vec复制Arora 2017（https://github.com/PrincetonML/SIF / https://openreview.net/forum?id=SyK00v5xx）。作者通过平均单词嵌入和减去第一主成分来计算句子嵌入。

感谢text2vec的作者，我可以计算手套嵌入并平均它们。下一步是计算主成分/ svd并从嵌入中减去第一个成分。

我可以使用irlba包计算svd（我相信它也用在tex2vec中），然而我仍然坚持如何从平均单词嵌入中删除de主成分。

论文中的python代码（https://github.com/PrincetonML/SIF/blob/master/src/SIF_embedding.py）具有以下功能

def remove_pc(X, npc=1):
"""
Remove the projection on the principal components
:param X: X[i,:] is a data point
:param npc: number of principal components to remove
:return: XX[i, :] is the data point after removing its projection
"""
pc = compute_pc(X, npc)
if npc==1:
    XX = X - X.dot(pc.transpose()) * pc
else:
    XX = X - X.dot(pc.transpose()).dot(pc)
return XX

我的R代码是

# get the word vectors
wv_context = glove$components
word_vectors = wv_main + t(wv_context)

# create document term matrix
dtm = create_dtm(it, vectorizer)

# assign the word embeddings
common_terms = intersect(colnames(dtm), rownames(word_vectors) )

# normalise
dtm_averaged <-  text2vec::normalize(dtm[, common_terms], "l1")

例如，如果我有1K句子x 300个变量，我运行irlba函数我得到三个矩阵。这些具有例如4个分量x 1K观测值。

如何转换此函数的输出（xK变量/组件为1K），这样我可以从句子嵌入中减去组件（1K x 300个变量）？

谢谢！

Answer 1

想法是，对于截断的SVD / PCA，您可以使用最小平方误差重建原始矩阵。因此，您以(U, D, V)的形式获得SVD，并且原始矩阵的重建为A ~ U * D * t(V)。现在我们从原始矩阵中减去这种重建 - 这正是作者提出的。这是一个例子：

library(text2vec)
data("movie_review")

it = itoken(movie_review$review, preprocessor = tolower, tokenizer = word_tokenizer)
dtm = create_dtm(it, hash_vectorizer(2**14))

lsa = LSA$new(n_topics = 64)
doc_emb = lsa$fit_transform(dtm)

doc_emb_pc1 = doc_emb_svd$u %*% doc_emb_svd$d %*% t(doc_emb_svd$v)
doc_emb_minus_pc1 = doc_emb - doc_emb_pc1

如果您有机会完成实施，请考虑将其贡献给text2vec - 这是Arora句子嵌入的门票 - https://github.com/dselivanov/text2vec/issues/157。

在Text2vec中实现Arora 2017

1 个答案: