Question

我原本认为在R个包text2vec，tm，quanteda，svs，qlcMatrix和{{1}之间将有一个函数来计算术语和上下文之间的PPMI（正向点互信息）（基于术语 - 术语（上下文）共现的矩阵） - 但显然不是，所以我去了我自己写了一个。问题是，它像糖蜜一样慢，可能是因为我对稀疏矩阵不是很好 - 我的tcms大约是10k * 20k，所以它们确实需要稀疏。

根据我的理解，wordspace，因此我推断：

PMI = log( p(word, context) / (p(word)*p(context)) )

其中count(word_context_co-occurrence) / N PMI = log( ------------------------------------- ) count(word)/N * count(context)/N是共生矩阵中所有共现的总和。 PPMI只是强制所有＆lt; 0值为0.（这是对的，对吧？）

考虑到这一点，这是尝试实施：

看起来很慢的不是计算本身，而是将新计算的值放入稀疏矩阵中（在这个微小的例子上它并不坏，但是如果你将数千行数千行，甚至是这个循环的一次迭代将需要永远;用library(Matrix) set.seed(1) pmat = matrix(sample(c(0,0,0,0,0,0,1,10),5*10,T), 5,10, byrow=T) # tiny example matrix; # rows are words, columns are contexts (words the row-words co-occur with, in a certain window in the text) pmat = Matrix(pmat, sparse=T) # make it sparse # calculate some things beforehand to make it faster N = sum(pmat) contextp = Matrix::colSums(pmat)/N # probabilities of contexts wordp = Matrix::rowSums(pmat)/N # probabilities of terms # here goes nothing... pmat2 = pmat for(r in 1:nrow(pmat)){ # go term by term, calculate PPMI association with each of its contexts not0 = which(pmat[r, ] > 0) # no need to consider 0 values (no co-occurrence) tmp = log( (pmat[r,not0] / N) / (wordp[r] * contextp[not0] )) # PMI tmp = ifelse(tmp < 0, 0, tmp) # PPMI pmat2[r, not0] = tmp # <-- THIS here is the slow part, replacing the old frequency values with the new PPMI weighted ones. } # take a look: round(pmat2,2)构建一个新矩阵似乎是一个更糟糕的主意。）

用新的PPMI加权值替换这种稀疏矩阵中的旧值有什么更有效的方法？要么改变这段代码的建议，要么在某些包中使用某些现有的函数我莫名其妙已经错过了 - 一切都很好。

Answer 1

同时想出来，这种方法相当快。我会留在这里以防其他人最终遇到同样的问题。也似乎与评论中的方法相关联（谢谢！）。

# this is for a column-oriented sparse matrix; transpose if necessary
tcmrs = Matrix::rowSums(pmat)
tcmcs = Matrix::colSums(pmat)
N = sum(tcmrs)
colp = tcmcs/N
rowp = tcmrs/N
pp = pmat@p+1
ip = pmat@i+1
tmpx = rep(0,length(pmat@x)) # new values go here, just a numeric vector
# iterate through sparse matrix:
for(i in 1:(length(pmat@p)-1) ){ 
  ind = pp[i]:(pp[i+1]-1)
  not0 = ip[ind]
  icol = pmat@x[ind]
  tmp = log( (icol/N) / (rowp[not0] * colp[i] )) # PMI
  tmpx[ind] = tmp    
}
pmat@x = tmpx
# to convert to PPMI, replace <0 values with 0 and do a Matrix::drop0() on the object.

如何在R中的稀疏矩阵上有效地计算PPMI？

1 个答案: