如何在R中的稀疏矩阵上有效地计算PPMI?

时间:2017-04-11 19:18:29

标签: r matrix sparse-matrix information-theory

我原本认为在R个包text2vectmquantedasvsqlcMatrix和{{1}之间将有一个函数来计算术语和上下文之间的PPMI(正向点互信息)(基于术语 - 术语(上下文)共现的矩阵) - 但显然不是,所以我去了我自己写了一个。问题是,它像糖蜜一样慢,可能是因为我对稀疏矩阵不是很好 - 我的tcms大约是10k * 20k,所以它们确实需要稀疏。

根据我的理解,wordspace,因此我推断:

PMI = log( p(word, context) / (p(word)*p(context)) )

其中 count(word_context_co-occurrence) / N PMI = log( ------------------------------------- ) count(word)/N * count(context)/N 是共生矩阵中所有共现的总和。 PPMI只是强制所有< 0值为0.(这是对的,对吧?)

考虑到这一点,这是尝试实施:

N

看起来很慢的不是计算本身,而是将新计算的值放入稀疏矩阵中(在这个微小的例子上它并不坏,但是如果你将数千行数千行,甚至是这个循环的一次迭代将需要永远;用library(Matrix) set.seed(1) pmat = matrix(sample(c(0,0,0,0,0,0,1,10),5*10,T), 5,10, byrow=T) # tiny example matrix; # rows are words, columns are contexts (words the row-words co-occur with, in a certain window in the text) pmat = Matrix(pmat, sparse=T) # make it sparse # calculate some things beforehand to make it faster N = sum(pmat) contextp = Matrix::colSums(pmat)/N # probabilities of contexts wordp = Matrix::rowSums(pmat)/N # probabilities of terms # here goes nothing... pmat2 = pmat for(r in 1:nrow(pmat)){ # go term by term, calculate PPMI association with each of its contexts not0 = which(pmat[r, ] > 0) # no need to consider 0 values (no co-occurrence) tmp = log( (pmat[r,not0] / N) / (wordp[r] * contextp[not0] )) # PMI tmp = ifelse(tmp < 0, 0, tmp) # PPMI pmat2[r, not0] = tmp # <-- THIS here is the slow part, replacing the old frequency values with the new PPMI weighted ones. } # take a look: round(pmat2,2) 构建一个新矩阵似乎是一个更糟糕的主意。)

用新的PPMI加权值替换这种稀疏矩阵中的旧值有什么更有效的方法?要么改变这段代码的建议,要么在某些包中使用某些现有的函数我莫名其妙已经错过了 - 一切都很好。

1 个答案:

答案 0 :(得分:0)

同时想出来,这种方法相当快。我会留在这里以防其他人最终遇到同样的问题。也似乎与评论中的方法相关联(谢谢!)。

# this is for a column-oriented sparse matrix; transpose if necessary
tcmrs = Matrix::rowSums(pmat)
tcmcs = Matrix::colSums(pmat)
N = sum(tcmrs)
colp = tcmcs/N
rowp = tcmrs/N
pp = pmat@p+1
ip = pmat@i+1
tmpx = rep(0,length(pmat@x)) # new values go here, just a numeric vector
# iterate through sparse matrix:
for(i in 1:(length(pmat@p)-1) ){ 
  ind = pp[i]:(pp[i+1]-1)
  not0 = ip[ind]
  icol = pmat@x[ind]
  tmp = log( (icol/N) / (rowp[not0] * colp[i] )) # PMI
  tmpx[ind] = tmp    
}
pmat@x = tmpx
# to convert to PPMI, replace <0 values with 0 and do a Matrix::drop0() on the object.